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Chapter 1 

Introduction 



1.1 A networked description of Nature and Society 

The recent interest of a wide interdisciplinary scientific community for the study of complex networks 
is justified primary by the fact that a network description of complex systems allows to get relevant 
information by means of purely statistical coarse-grained analyses, without taking into account the 
detailed characterization of the system. Moreover, using an abstract networked representation, it is 
possible to compare, in the same framework, systems that are originally very different, so that the 
identification of some universal properties becomes much easier. Simplicity and universality are two 
fundamental principles of the physical research, in particular of statistical physics, that is tradition- 
ally interested in the study of the emergence of collective phenomena in many interacting particles 
systems, even outside its classical fields of research such as condensed matter theory. 
Along the last century, statistical physicists have developed a suite of analytical and numerical tech- 
niques by means of which it has been possible to understand the origin of phase transitions and 
critical phenomena in many particle systems, and that have been successfully applied also in other 
fields, from informatics (e.g. optimization problems) to biology (e.g. protein folding) and social sci- 
ences (e.g. opinion formation models). The presence of disorder, randomness and heterogeneity is the 
other important ingredient that justifies the use of statistical physics approaches in so many different 
fields and in particular in the study of complex networks, that present non-trivial irregular topological 
structures. 

From a mathematical point of view, complex networks are sets of many interacting components, the 
nodes, whose collective behavior is complex in the sense that it cannot be directly predicted and 
characterized in terms of the behavior of the individual components. The links connecting pairs of 
components correspond to the interactions that are responsible of the global behavior of the system. 
It is clear that a large number of systems can be described in this manner, thus it is not difficult 
to find disparate examples of networks both in nature and society. The most evident application of 
networks theory is the study of the Internet |2(Ki| . whose detailed characterization is not possible, but 
that can be investigated using the statistical analysis of its topological and functional properties. In 
general, all infrastructures fit very well the framework of networks theory, so that most of the real 
networks studied are communication or transportation networks (such as the Internet, the Web, the 
air-transportation network, power-grids, telephone and roads networks, etc) |1()2| . A second class is 
represented by those networks related to social interactions |245| . such as sexual-contact networks, 
networks of acquaintances or collaboration networks. Finally, another large class concerns biological 
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networks |153Mll)2] (e.g. protein interaction networks, cellular networks, neural networks, food- webs, 
etc.). 

The massive use of statistical techniques in the characterization of complex networks is, however, 
closely related with the recent improvement of computers, that allow to easily retrieve, collect and 
handle large amounts of data. In the last decade, indeed, the analysis of the topological structure 
of large real networks such as the Internet and the Web pointed out that many real networks have 
unexpected topological properties, characterized by heterogeneous connectivity patterns |120| . These 
surprising results were in contrast with the common belief that real networks could be modeled using 
either regular networks (e.g. grids or fully connected networks) or random graphs (i.e. networks 
in which nodes are randomly connected in such a way that all them have approximately the same 
number of connections |77|). These models have been studied for a long time without the necessity 
of a relevant statistical analysis, since in such networks all nodes are approximately equivalent, and 
the overall behavior of the network is well represented by monitoring that of a single node. On the 
contrary, the recent discoveries immediately revealed a completely different scenario. 
A large number of data about various networks have been gradually collected, ranging from social 
sciences to biology, all of them presenting the same type of heterogeneity in the connectivity patterns. 
The necessity of a more mathematical analysis of networks excited a large number of physicists, who 
recognized the possibility to apply the powerful methods of statistical physics. Without going into 
a detailed description of the statistical framework that physicists have built introducing statistical 
physics methods into the ideas inherited from graph theory (see Chapter [5] for an introduction), the 
main achievement in the characterization of networks topology is the identification of few universal 
features that are common to many networks and allow to divide them into different classes. 
A first relevant property regards the degree of a node (i.e. the number of connections to other nodes). 
In real networks, the probability of finding a node with a given degree (i.e. the degree distribution) 
significantly deviates from the peaked distributions expected for random graphs and, in many cases, 
exhibits a broadly skewed shape, with power law tails with an exponent between 2 and 3. In this 
range of values for the exponent, the distribution presents diverging second moments, meaning that 
we can find very large fluctuations in the values of nodes connectivity (scale-free property). 
Moreover, real networks are characterized by relatively short paths between any two nodes (small-world 
property |177l &47 ) , a very important property in determining networks behavior at both structural 
and functional levels. The small world property, while intriguing, was already present in random 
graphs models, in which the average intervertex distance scales as the logarithm of the number of 
nodes. However, the novelty is due to the fact that real networks present this property together with 
a high density of triangles and other small cycles or motifs, that are completely absent in traditional 
random graphs, whose local structure is tree-like. 

These unexpected results have initiated a revival of network modeling, resulting in the introduction 
and study of new classes of modeling paradigms |1(J21 11911 121)31 0] . Many efforts have been spent 
to conceive models that are able to reproduce and predict the statistical properties of real networks, 
but researchers have soon realized that the characterization of real networks is not exhausted by its 
topological properties and that in real networks topology and dynamics are intrinsically related. 
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1.2 Relation between Topology and Dynamics: a question of 
timescales 

The dynamical phenomena related to complex networks can be summarized in three different cate- 
gories: the dynamical evolution of networks, the dynamics on networks, and the dynamical interplay 
between the networks topology and processes evolving on them. 

The topology of real networks is indeed far from being fixed, the number of nodes and links changing 
together with local and global properties of the system. In particular, evolutionary principles are often 
necessary ingredients in order to explain some peculiar topological properties of networks (e.g. the 
preferential attachment principle is necessary to understand the emergence of degree heterogeneity in 
networks such as the Internet or the Web). 

On the other hand, networks are structures on which dynamical processes take place, thus it is inter- 
esting to study the behavior of dynamical systems models evolving on networks. Many of them, such 
as routing algorithms, oscillators, epidemic spreading, or searching processes, have direct applications 
in the study of the dynamical phenomena observed on real networks, others such as random walks, 
statistical mechanics models, opinion formation, percolation, and strategic games provide more gen- 
eral information that can be used to build a common theoretical framework by means of which the 
different properties of dynamical processes on networks can be analyzed and explained. 
The third situation, characterized by the interplay of the dynamics "of networks" with the dynamics 
"on networks" is more complicated and such kind of problems has been only recently considered by the 
complex networks community. Moreover, this case has been usually neglected because of the different 
temporal scales of the two types of dynamics. We can indeed assume that the structural properties of a 
network evolve with a time-scale tt, while a particular dynamical process taking place on the network 
evolves with time-scale td, the above mentioned situation corresponding to the case tt — td- When 
tt 3> td, we can study the evolution of dynamical processes on networks with quenched topological 
structure; while the case td ~> tt means that the temporal evolution of processes on the network is 
neglected compared to that of the network structure itself. The latter case holds not only when these 
dynamical processes are slow compared to the changes in the topology, but also in the situation in 
which these processes are fast but they do not influence the structure of the network. 
While the evolution of networks topology has been largely investigated in the past, the present thesis 
is devoted to study some aspects of dynamical processes on networks, i.e. the case in which tt ^> td- 
Actually, the scenario is much more complicated since real networks are usually characterized by a 
large number of dynamical processes evolving at the same time, so that in addition to the mentioned 
topological and dynamical temporal scales, we have to distinguish the temporal scale governing the 
evolution of a single process rf-, from that governing the evolution of the overall average properties of 
that class of processes r^. This is simple to understand if we think at the functioning of the Internet. 
Billions of data-packets are continuously transferred between the routers, each one performing a sort 
of (random) walk on the network from a source to a destination. But looking at the global average 
properties of the traffic, we observe sufficiently stable quantities, so that we can encode the average 
traffic between two neighboring routers (in terms of transferred bytes) using a single value, a weight, 
by means of which we can label the corresponding link. 

Therefore, a first way to take into account of the dynamics is that of endowing the links with weights, 
representing the flow of information or the traffic among the constituent units of the system. More 
generally, a weighted network representation allows to take into account the functional properties of 
networks. 
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On the other hand, it is also important to focus on the dynamical behavior of single dynamical pro- 
cesses, such as the spreading of information or viruses on social and infrastructure networks, or the 
processes of networks exploration from a given source node. One of the striking results of this scales 
separation is that it is also possible to study single dynamical processes, such as spreading and perco- 
lation, in a weighted network, with quenched structural and functional properties (i.e. -C tt and 

rb « t °d)- 

As a final remark, we note that also the general motivation with which we have studied dynamical 
phenomena on networks is twofold. On the one hand, we wanted to study the effects of inhomo- 
geneous topological and functional properties on the behavior of some classes of dynamical models; 
on the other hand, we have exploited some of these dynamical phenomena in order to investigate 
unknown topological properties of real networks. This twofold role held by dynamical processes is 
maybe the most important idea that statistical physicists should learn from the new interdisciplinary 
field of complex networks. Physicists have been used for many year to study a variety of interactions 
on very well defined topologies, now we have to face a more complex scenario, in which the role of 
topology and dynamical rules may be even inverted, i.e. well-defined dynamical phenomena can be 
used to uncover topological properties of the system. 

1.3 Summary of the thesis 

The work developed in this thesis concerns the study of various aspects of dynamical processes on 
networks: each chapter is devoted to a particular issue, but apparently different problems are related 
by the general scenario that we have mentioned in the previous paragraph. 

Chapter [3 provides an introduction to the science of complex networks: in the first part we 
recall the main statistical measures used to analyze networks; then we give some examples of real 
complex networks, focusing on the Internet and the World-wide Air-transportation Network; the final 
section is devoted to review the most important theoretical models of complex networks. This is not 
an exhaustive introduction, but it is conceived to give the most relevant notions that are used or 
mentioned in the rest of the work. 

Chapter concerns the theoretical characterization of the processes of exploration of complex 
networks. The relevance of this topic resides in the fact that the topology of real networks is often 
only partially known, and the methods used to acquire information on such topological properties 
may present biases affecting the reliability of the phenomenological observations. We consider several 
different types of networks sampling methods, discussing their advantages and limits with respect to 
their natural fields of application. We focus in particular on a tree-like exploration method used in 
real mapping processes of the Internet and referred as traceroute exploration. In order to verify the 
reliability of the experimental data, and consequently of the main properties, such as the existence 
of a broad degree distribution, that have been derived from their analysis, we propose a theoretical 
model of traceroute exploration of networks. This model allows an analytical study by means of a 
mean-field approximation, providing a deeper understanding of the relation between the topological 
properties of the original network and those of the sampled network. Moreover, massive numerical 
simulations on computer-generated networks with various topologies allows to have also a clearer 
quantitative description of the mapping processes. The general picture acquired from this study is 
finally exploited to introduce a statistical technique by means of which some of the biased quantities 
can be opportunely corrected. 

This topic is also a clear example of the possible use of dynamical processes for the characterization 
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of unknown topological properties of real networks. 

In Chapter 21 we take into account the weighted network representation and its relation with the 
functional properties of the network. In the first part of the chapter, we investigate the role of weights 
in determining the functional robustness of the system, and we compare the results with those based 
on purely topological measures. We use the case study of the airports network. The main idea is 
that of measuring the vulnerability of the network using global observables based on both topological 
and traffic centrality measures: we remove the most central nodes according to different centrality 
measures, monitoring the effects on the structural and functional integrity of the system. 
This study gives an example of the different roles played by topology and weights, at the same time 
pointing out the validity of a static representation of networks functionality (i.e. encoding average 
flows and traffic into weights on the edges) . The second part of the chapter is instead devoted to study 
weighted networks from a purely dynamical point of view. Exploiting some remarkable properties of 
percolation theory, we build a general theoretical framework in which spreading processes on weighted 
networks can be analyzed. 

Using an analogy with the scenario proposed in the previous paragraph, passing from the first to the 
second part of the chapter, we pass from a situation in which we are interested only in the structural 
and functional properties determined by the average dynamical behavior of the system, to the study 
of the effects of such (structural and functional) properties on the evolution of a particular dynamical 
process on the network. 

Chapter [S] is completely devoted to the analysis of the recently proposed Naming Game model, 
that was conceived as a model for the emergence of a communication system or a shared vocabulary 
in a population of agents. The rules governing the pairwise interactions between the individuals are 
simple but present several new features such as negotiation, feedback and memory, that are typical 
properties of human social dynamics. For this reason the model can be usefully applied also in different 
contexts, such as problems of opinion formation. 

The dynamical evolution of the model is studied considering populations with different topologies, 
from regular lattices to complex networks, showing that the dynamical phenomena generated by the 
model depend strongly on the topological properties of the system. In the last section of the chapter, 
the attention is focused on the activity patterns of single agents, that display rather unexpected 
properties due to the non-trivial relation between memory and degree heterogeneity. 

General conclusions on the work done and possible future developments of the ideas exposed in 
the thesis are reported in Chapter |SJ 
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Chapter 2 

Structure of complex networks: an 
Overview 

2.1 Introduction 

The first step toward a complete characterization of complex networks consists in a reliable description 
of their topological properties. As we will see in the following chapters, topological quantities play a 
relevant role in determining the functionality of real networks as well as the dynamical patterns of 
processes taking place on them. Consequently, we devote Section r2.2l to introduce a set of mathematical 
tools, some of them borrowed from Graph Theory, that will be useful in the statistical investigation 
of complex networks. In Section [2.31 several examples of real complex networks are reported, together 
with the analysis of their most important topological properties. Special care is reserved to the Internet 
and the World-wide Air-transportation Network, whose topological and dynamical properties will be 
further investigated in Chapters 13141 Finally, in Section |2~H we present a brief overview of the main 
models of complex networks, that are commonly used in order to reproduce topological and dynamical 
properties observed in real networks. The present chapter is not supposed to be an exhaustive review 
of all recent developments in the Science of Complex Networks, for which we refer to some very good 
books '203l ll(J2| . and review articles |ll)3l lH ITM] . Similarly, for a simple introduction to Graph Theory 
we refer to Ref. [77J , while a more rigorous approach is provided by the book of Bollobas 0S| • 
Our purpose is more properly that of providing a brief description of the measures used in networks 
analysis, focusing only on those concepts that are useful for a better comprehension of the work 
developed in the thesis. 
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2.2 Statistical Measures of Networks Topology 

Graph theory is a fundamental field of mathematics whose modern formulation can be ascribed to P. 
Erdos and A. Renyi, for a series of papers appeared in the early '60s in which they laid the groundwork 
for the study of random graphs [1151 II lfc>| . In the following, we go through the basic notions of graph 
theory, enriching them with the definition of other more recently introduced quantities, that are 
commoly used for the statistical characterization of networks structure. 

2.2.1 Basic notions of Graph Theory 

An undirected graph G is a mathematical structure defined as the pair G — (V,£), in which V is a 
non-empty set of elements, called vertices, and £ is the set of edges, i.e. unordered pairs of vertices. 
More generally, each system whose elementary units are connected in pairs can be represented as 
a graph. In the interdisciplinary context the nomenclature used is not equally clear. Vertices are 
usually called nodes by computer scientists, sites by physicists and actors by sociologists. Edges are 
also addressed as links, bonds, or ties. We will use indifferently these terms, without any reference 
to a particular field of research. The cardinality of the sets V and S are denoted by N and E. The 
number of vertices is also referred to as the size of the graph. The simplest generalization of the 
definition of graph is that of directed graph, obtained considering oriented edges (arcs), i.e. ordered 
pairs of vertices. A graphical representation of a graph consists in drawing a dot for every vertex, and 
a line between two vertices if they are connected by an edge (see Fig. I2.1|) . If the graph is directed, 
the direction is indicated by drawing an arrow. 

A convenient mathematical notation to define a graph is the adjacency matrix A = {aij}, a N x N 
matrix such that 

dij = < (2.1) 
10 otherwise . 

The adjacency matrix of undirected graphs is symmetric. Two vertices joined by an edge are called 
adjacent or neighbors; the neighborhood of a node i is the set V(i) of all neighbors of the node i. The 
number of neighbors of a node i is called the degree fcj = Y] j aij of i. In case of directed edges, we have 
to distinguish between incoming and outcoming edges, thus we define an in-degree (fc™ = Ylj a ji) 
and an out-degree (k° ut — J2j a ij)- We do not go deeper into the definition of properties for directed 
graphs, since in this thesis only undirected networks will be explicitly studied. 

Moreover, we consider graphs in which vertices do not present self-links (i.e. edges from a vertex 
to itself), or multi- links (i.e. more than one edge connecting two vertices). Such objects, whose 
properties are rather unusual in real networks, are known in graph theory as multigraphs. If we 
exclude self-links and multiple links, the maximum possible number of edges is N(N — l)/2. Those 
graphs, whose number of edges is close to such a value, are called dense graphs, while the graphs in 
which the number of edges is bounded by a linear function of N are sparse graphs. 
A generalization of the notion of graph, that will be repeatedly taken into account in the following 
chapters, is that of weighted graph. In weighted (directed or undirected) graphs, each edge (i, j) 
carries a weight, that is a variable u)y assuming real (or integer) values. However, also the nodes 
can be differentiated, introducing classes of nodes with the same set of internal variables. Graphs 
with distinct classes of nodes will be denoted as multi-type graphs. Graphs in which there are two or 
more distinct sets of nodes with no edges connecting vertices in the same set are commonly referred 
as multipartite graphs. Real networks are actually weighted and multi-type, though in many situations 
it is more convenient to study their properties by means of single-type, unipartite and/or unweighted 
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A) 




B) 




directed edges 




D) 



4 



k, 



weighted edges 



Figure 2.1: Basic elements of graph theory and their typical graphic representations: (A) vertices and 
edges in an undirected graph, (B) directed edges indicated by arrows in a directed graph, (C) the 
degree of a node, and (D) weighted edges. 

representations. 

2.2.2 Degree distribution 

A natural way to collect nodes in classes is that of considering nodes with the same degree k. This 
is a convenient strategy to analyze large graphs, since the connectivity properties of the nodes are 
statistically represented by the histogram P(k) = Nk/N, in which N). is the number of vertices 
of degree equal to k. In the infinite size limit (N — > oo), P(k) is called degree distribution, since it 
represents the probability distribution that a node has degree k. The degree distribution P(fc) satisfies 
a normalization condition X^fcLo-^W = The overage degree of an undirected graph is defined as 
the average value of k over all the vertices in the graph, 



The condition of sparseness for a graph can be translated into (k) ~ 0(1). 

However, in order to study topological properties of networks, the knowledge of higher moments of the 
degree distribution is also important. For instance, the second moment (k 2 ) measures the fluctuations 
of the degree distribution, and governs the percolation properties jSU]; while higher moments determine 
conditions for the mean-field behavior of the Ising model on general networks |104| . 
For a long time, Graph Theory has been interested in random graphs with homogeneous connectivity, 
i.e. with a degree distribution that is very peaked around a characteristic average degree and decays 
exponentially fast for fc> (k). On the contrary, recent phenomcnological findings have shown that a 
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large number of real networks present heavy-tailed distributions, some of them close to a power-law 
behavior. In these networks, there is a non-negligible probability of finding "hubs" , i.e. nodes of 
degree k ^> (k). 

2.2.3 Two and three points degree correlations 

The degree distribution does not exhaust the topological characterization of a network, since it has 
been shown that many real networks present degree correlations between nodes, i.e. the probability 
that a node of degree k is connected to another node of degree k' depends on k and k' themselves. More 
rigorously, we can introduce a conditional probability P(k'\k) that a vertex of degree k is connected to 
a vertex of degree k' . This quantity satisfies a normalization J2k' P{k'\k) = 1 and a detailed balance 
condition 0H| 

kP{k'\k)P{k) = k'P(k\k')P(k') , (2.3) 

corresponding to the absence of dangling bonds. In uncorrelated graphs, P(k'\k) does not depend on 
k and it can be easily obtained from the normalization condition and Eq. 12.31 

P{k'\k) = ^1 . (2.4) 

Similarly, it is possible to define a three-points correlation function P(k' , k"\k), i.e. the probability 

that a vertex of degree k is simultaneously connected to vertices of degree k' and k" . 

In general, the direct measurement of these two conditional probabilities is quite cumbersome and 

gives very noisy results on any kind of network. For this reason one usually prefers more practical 

estimates by means of indirect quantities, that are averaged over the neighborhood of a node. 

In a given network with adjacency matrix {(%■}, a good estimation of the degree correlations of a 

vertex i is provided by the average degree of the nearest neighbors of i 

1 N 

Defining the network using its degree distribution, the average degree of the nearest neighbors of a 
vertex of degree k is 

k nn (k)=J2 k ' p (k'\k) . (2.6) 

k' 

If the network is uncorrelated, the degree of the neighbors can assume any possible value, and the 
average turns out to be approximately independent of k, i.e. k nn (k) ~ const. On the contrary, 
correlated networks can be schematically divided in two large classes. The first class is that of those 
presenting assortative mixing, i.e. nodes of high (small) degree are more likely to be connected 
with nodes of high (small) degree (k nn (k) grows with k). This seems to be a general property of 
social networks. When vertices of high degree are preferentially linked with vertices of smaller degree 
(and viceversa), i.e. k nn (k) is a decreasing function of fc, the network has dis assortative mixing. 
Many critical infrastructures such as transportation and communication networks present a clearly 
disassortative behavior. 

Analogously, for three-points correlations, we can define a quantity called clustering coefficient that 
measures the tendency of a graph to form cliques in the neighborhood of a given node. As depicted in 
Fig. I2.21 A. the clustering coefficient of a node i is defined as the ratio of the actual number of edges 
Ui between the neighbors of i, and the maximum possible number of such edges ki{ki — l)/2, i.e. 

C U) - 2ni - ai i a i hahi ( 2 7) 

ki{ki 1) hi(^k{ 1) 
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Figure 2.2: (A) The clustering coefficient gives a measure of the local cohesiveness in the neighborhood 
of a vertex: the central node in the example has clustering coefficient c = 1 if all pairs of neighbors 
of the vertex are connected, c = 0.5 if only half of the possible pairs are connected, and c = if no 
triangles are formed. (B) The dashed path represents the shortest path (of length £ij = 4) between 
nodes i and j. 

The study of the clustering spectrum c(fc), 

C W = w k £ c « ' ( 2 - 8 ) 

i/ki—k 

provides interesting insights on the local cohesiveness of the network. In particular, a clustering 
coefficient decreasing with the degree k has been put in relation with the existence of hierarchical 
structures (e.g. in biological networks |210| L 

2.2.4 Shortest path length and distance 

Many non-local properties of graphs are related to the reachability of a vertex starting from another. 
A walk from a vertex i to a vertex j consists in a sequence of edges and vertices joining i with j. The 
length I of the walk coincides with the number of edges in the sequence. A path is a walk in which no 
node is visited more than once. A graph is connected if for any pair of vertices i and j, there is a path 
from i to j. The number of walks of length I between two nodes i and j can be expressed by the l-th 
power of the adjacency matrix 

= °ni\ a i\ii ' " a ii-ij ■ (2-9) 

it ,i2,...,ii-i 

In particular, this definition is related to the behavior of a random walker on the graph. A closed 
walk, in which initial and final vertices coincide, is called a cycle; a fc-cycle is a cycle of length k. 
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The walk of minimum length between two nodes is called shortest path, and its length corresponds to 
the hop distance £ij between the nodes i and j (see Fig. I2.2j -B'). The diameter £ max is the maximum 
distance between pairs of nodes in the graph, while the average distance (£) between nodes is given 



A complete characterization of the metric properties of a graph corresponds to know the full proba- 
bility distribution Pg{£) of finding two vertices separated by a distance I. In fact, many real networks 
present a symmetric distribution peaked around the average value {£), that can be safely considered 
representative of the typical distance between nodes in the network. 

From this point of view, complex networks seem to share a striking property, called small-world effect 
[2471 124fi| , meaning that the average intervertex distance {£) is very small compared to the size N 
of the network, scaling logarithmically or slower with it. While this property can be found also in 
generic random graphs, where (I) cx logA^, the result is in contrast with the behavior of the distance 
on regular d-dimensional lattices, in which {£) cx N 1 ^. 

The practical implication of the small- world property is that it is possible to go from a vertex to any 
other in the network passing through a very small number of intermediate vertices. In this regard, 
the concept of small- world was firstly popularized by the sociologist S. Milgram in 1967 by means of a 
famous experiment |1771 |. in which he showed that a low number of acquaintances, on average only six, 
is actually sufficient to connect (by letter) any two individuals in the United States. The experiment 
was recently reproduced using the world-wide e-mail network and provided results consistent with the 
small- world hypothesis |l()()j . 

Note that the presence of the small-world property is relevant not only at a topological level, but it 
has also strong effects on all dynamical processes taking place on the network. 

A plethora of different statistical measures is based on the notions of distance and shortest path, some 
of them are used in the topological characterization of networks, others in the study of the relation 
between functional properties and dynamics (see Chapter^); we concentrate our attention on central- 
ity measures that will be extensively used in the following chapters. 

The metric properties of a graph are, indeed, very appropriate to define several different measures 
of centrality, that are used in social sciences to estimate the importance of nodes and edges. The 
most local of these measures is the degree centrality, that is proportional to the degree of a node and 
does not account for any metric feature of the graph. All other centrality measures involve non-local 
properties in the form of the intervertex distance. For instance, the closeness centrality of a node i is 
defined as the inverse of the sum of the distances of all nodes from i. 

The most famous measure of centrality is the betweenness centrality, defined in Ref. |123| and recently 
adopted in network science as the basic definition of centrality of nodes and edges. The node between- 
ness centrality hi computes the relative number of all shortest paths between pairs of vertices that 
pass through the vertex i, i.e. 



by 




(2.10) 




(2.11) 



where a s t — o~ts is the number of shortest paths between the vertices s and t, and a s t(i) = <r ts (i) is the 
number of them going through the node i. Similarly, the edge betweenness centrality fry is defined as 
the fraction of all shortest paths from any pair of vertices in the network that pass through the edge 
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Figure 2.3: (A) Star-like network, the node v (in red) has maximum betweenness because all shortest 
paths have to pass through the central node. (B) A fully connected clique is characterized by a zero 
betweenness value for all nodes. (C) The node v (in red) connects two highly connected groups of 
nodes: its betweenness is very high even if its degree is very low. 



in which a s t(i,j) is the number of shortest paths going through the edge It is worthy to 

remark that in the literature there are several slightly different definitions of betweenness centrality: 
in particular, a prefactor 1/2 can be considered in order not to count twice the paths, while the paths 
containing the interested nodes (i.e. i and/or j) as initial or ending points can be accepted or discarded 
(the two cases differing just by a constant contribution). The computational cost of determining the 
betweenness centrality for all vertices (or edges) in a graph is very high, since one has to discover 
all existing shortest paths between pairs of vertices. An optimized algorithm, proposed by Brandes 
|53| . allows to reduce the computational complexity from 0(N 2 E) to 0{NE). For sparse graphs the 
algorithm performs in 0(N 2 ) steps, that is still a high complexity when the size of the network is 
very large (e.g. N ~ O(10 6 )), or when the computation has to be repeated many times, as for the 
measures exposed in Chapter 0] 

Figuring out the meaning underlying the notion of betweenness centrality is simple by means of few 
examples dealing with extreme topological conditions. Let us consider a star network with a unique 
central vertex v and N—l leaves at a distance 1 from the center (see Fig. l2.3t AL The node betweenness 
centrality of v is simple to compute because v belongs to all shortest paths between pairs of leaf nodes, 
therefore the sum in Ea. l2.1ll becomes a sum of unit contributions and we get b v — (N — 1)(N — 2). 
The opposite situation is the complete graph, in which all vertices, according to the definition in 
Ea. I2.ll! have zero betweenness centrality fFie. I2.31 BL Another interesting case is that of a node 




(2.12) 
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or an edge joining, as a bridge, two otherwise disconnected portions of a network (Fig. I2.31 C1: all 
paths connecting pairs of nodes belonging to different regions have to pass through that particular 
node (edge), that turns out to have very high betweenness even if it may have very low degree. This 
property shows that in many networks, as we will see, betweenness centrality is non-trivially correlated 
with the other topological properties. 

2.2.5 Subgraph structures 

In this paragraph, we discuss a series of topological properties dealing with the structure of subsets of 
a graph. Firstly, a graph G' = (V, £') is a subgraph of the graph G = (V, £) if V C V and £' C £ . A 
maximal subgraph with respect to a given property is a subset of the graph that cannot be extended 
without loosing that property. Given a subset of nodes V C V, we call G' = (V,£\V) C G the 
subgraph of G induced by V' . A component of a graph G is a maximally connected subgraph of G; it 
is called giant component if its size is 0{N). 

We have already seen that the clustering coefficient is a measure of the cohesiveness of a graph; 
however, the maximal cohesiveness corresponds to sets of nodes with all-to-all connections, called 
cliques. Formally, a clique is a maximally complete subgraph of three or more nodes. Though there 
are several other quantities involving the subgraph's definition, such as the n-cliques, or the fc-plexes, 
for the purposes of this work, we are only interested in two of them: fc-cores and fc-shells. 
The k-core of a graph G is the maximal induced subgraph of G whose vertices have the property of 
having degree at least k [471 122 1| . (Note that it means that they must have degree at least k inside 
the subgraph!). Such a subgraph can be obtained by recursively removing all the vertices of degree 
lower than k, using a procedure called k-core decomposition. 

Let us call Nk the number of nodes in the graph with degree not larger than k and Ck the set of nodes 
belonging to the fc-core, the algorithm reads 

(1) Set k = (C k is empty Vfc > 0, C = G); 

(2) fc->fc + l; 

(3) Prune all nodes with degree lower than k (and the corresponding edges); 

(4) Update Ni, VI > according to the pruned network; 

(5) Repeat point (3) until Ni <k = 0; 

(6) Put all remaining nodes (and edges) in Ck and go back to point (2). 

A node has shell index k if it belongs to the fc-core but not to the k + 1-core; the k-shell is the set of 
all vertices of shell index k, i.e. the difference between two consecutively nested cores. The algorithm 
is very clear if we look at simple cases similar to that sketched in Fig. 12.41 in which fc-cores and fc-shell 
are highlighted using different colors. 

Finally, we usually refer to communities when the graph can be reduced to a certain number of 
subgraphs characterized by the property that, for each of them, the number of edges connecting the 
subgraph with the rest of the graph is very small compared to the number of edges linking different 
vertices within the same subgraph. In such a case each subgraph is a community, as depicted in 
Fig. 12.51 The definition of community is not rigorous, thus the community structure of a network 
depends strongly on the practical method used to detect the subgraphs. Several algorithms have 
been proposed in order to find the community structure of networks. Some of them reduce iteratively 
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Figure 2.4: Illustration of the fc-core structure of a simple graph. The blue circle contains the 1-core, 
i.e. the giant connected component of the graph. The two smaller contours draw the boundaries of 
the 2-core (red) and 3-core (black). Blue nodes belong to the 1-shell, the red and black ones to the 
2-shell and 3-shell respectively. The internal sets can be obtained from the larger ones by iterative 
pruning of nodes as explained in the text. 

the size of the different subgraphs, while others are based on the opposite principle of clustering 
algorithms, but all them suffer of the same incapacity of detecting the correct level at which the 
iterative procedure should be stopped. This is probably an intrinsic problem due to the absence of a 
rigorous definition able to fix the correct resolution at which the community structure is more visible. 
Some of the algorithms used to detect communities are based on topological properties, such as the 
betweenness |128| . others exploit the properties of some dynamical systems, e.g. synchronizability of 
oscillators |12| . In Chapter [3 we will show that also non-equilibrium models of coarsening dynamics 
can be used to put forward alternative methods to detect communities. 

2.2.6 Further metrics for weighted networks 

In many real networks, edges are not identical, they can have different intensities, that are related to 
some physical properties and are taken into account assigning them a weight. For instance, in the 
Internet the edges represent physical connections, cables, thus weights could be introduced to account 
for their bandwidth, or the traffic between routers. In the air-transportation network, the weights are 
proportional to the traffic on the airline connections. Hence, in the technological and infrastructure 
networks, weights usually correspond to some physical quantities (energy, information, goods, . . . ) 
that are transferred between two nodes. On the other hand, in biological networks weights account 
for the strength of the interactions between genes or proteins; whereas in social networks they specify 
the intensity of interactions between the actors. 
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Figure 2.5: An example of network with a very strong community structure. The number of edges 
connecting nodes belonging to the same community (black full links) is much larger than the number 
of edges between nodes in different communities (red dashed links). 

Many statistical quantities that have been introduced for unweighted networks can be easily general- 
ized to weighted networks. The degree is generalized introducing the node strength; the strength Si of 
a node i is 

Si = Wij , (2-13) 
jev 

where is the weight on the edge (i,j) [2*31 I252| (see Fig. l2.6l A). 

If the weights are distributed uniformly at random, the node strength turns out to be linearly correlated 
with the degree, i.e. s(k) ~ (w) k. In fact, the actual degree-strength correlations observed in many 
real networks suggest rather a super-linear relation s(k) ~ Ak@ , with (3 > 1 and A ^ (w) (see 
Chapter HI). 

A standard measure in the analysis of weighted graphs is the strength distribution P a (s), that 
says which is the probability that a randomly chosen node has strength equal to s. Many weighted 
networks with broad degree distribution, such as the World-wide Air-transportation Network, also 
present broad weight and strength distributions. 

Other quantities are readily extended in order to account for weights, in particular two- and three- 
points correlations. For each vertex i one can define a weighted average nearest neighbors degree, 



and a weighted clustering coefficient 

1 \ ^ Wij + Wj m . . 



c w = 

1 Sl {h - 1) 



2.2. STATISTICAL MEASURES OF NETWORKS TOPOLOGY 



17 



A) 




h B) 




j 



c = 0.5 < c 



w 



h 




J 



w = 1 

ih 



I 



Figure 2.6: (A) Strength s, of the node i computed as the sum of the weights along the edges 
connecting i with its neighbors. (B) The weighted clustering coefficient c w gives a measure of local 
cohesiveness in presence of weights. Larger weights are responsible for larger weighted clustering. (C) 
The weighted shortest path (red dashed links) can be topologically (number of links) longer than the 
unweighted shortest path (grey dashed links). The weighted shortest path depends on the definition 
of edge length, that in this case is £y = 1/?%-. 

The degree dependent functions, k^ n (k) and c w (k) can be directly compared with the unweighted 
measures k nn (k) and c(fc), providing interesting information on the role of the weights. For the 
clustering coefficient, the interpretation is particularly easy (Fig.EJB): when the weighted clustering 
coefficient is larger than the topological one, it means that triples are more likely formed by edges 
with larger weights. A similar interpretation holds for the relation between k^ n {k) and k nn (k). 
With respect to unweighted networks, when edges are weighted the neighborhood of a node is not 
homogeneous, namely the edges outgoing from a given vertex can in principle carry very different 
weights. In order to quantify the homogeneity of the neighborhood of a node i, we consider the 
disparity measure |97l I84j . 



This quantity depends on the degree, in such a way that, when the weights are comparable, Y(k) ~ 
1/fc, while when an edge dominates on the others Y(k) ~ 1. 

Many other weighted quantities have been defined, but the most important for the topics discussed in 
the rest of this thesis are the weighted centrality measures and, in particular, the weighted betweenness 
centrality. 

Actually, it is sufficient to define the weighted shortest path and all centrality quantities can be directly 
constructed from it (Fig. 12.61 -0 . To each edge (i,j) in the network, we associate a distance that is a 




(2.16) 
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function of the weight, i.e. £]J = £ w (wij), whose explicit form depends on the nature of the weights. 
For instance, in the airports network weights are proportional to the traffic capacity, and a larger 
traffic capacity leads to a better transmission along an edge, that from a "functional" point of view 
corresponds to decrease the distances. Hence, we expect that the weighted distance along the edge 

is inversely proportional to the weight, i.e. £^ cx 
The definition of the node weighted betweenness centrality bf 1 consists in replacing all shortest paths 
with their weighted versions. For any two nodes h and j, the weighted shortest path between h and 
j is the one for which the total sum of the lengths of the edges forming the path from h to j is 
minimum, independently from the number of traversed edges. We denote by <xj^ the total number of 
weighted shortest paths from h to j and <xj^(i) the number of them that pass through the vertex % 
(with h,j 7^ i); the weighted betweenness centrality (WBC) of the vertex i is then defined as 



where the sum is over all the pairs with j =/= h i. Similarly, we can define a weighted edge 
betweenness. The algorithm proposed by Brandes in Ref. [HSJ can be easily extended to weighted 
graphs, with no further increase in complexity. 




(2.17) 
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Figure 2.7: Different granularity representations of the Internet: routers level (left) and autonomous 
systems level (right). Routers are represented by blues nodes grouped in autonomous systems (circles). 

2.3 Examples of real networks 

In this section, we review of the phenomenological properties of two important real networks: the 
Internet (in Section 12.3.1(1 and the World-wide Air-transportation Network (in Section 12.3.2(1 . At the 
same time it is an opportunity to show the practical use of the statistical measures defined in the 
previous section. Some of these properties are reliably considered quite universal in complex networks, 
such as heavy-tailed degree distributions or the small-world property, others are typical features of 
the particular network under study. Though a detailed discussion is reserved only for those networks 
that have been objects of direct investigation in this thesis, for the sake of completeness we provide 
in Section 12.3.31 a brief overview on the phenomenology of some other real networks. 

2.3.1 The Internet 

The Internet is a communication network in which the vertices are computers and the edges are the 
physical connections among them. The existence of various types of vertices reflects the high level 
of complexity and heterogeneity of the system: the hosts correspond to single-user's computers; the 
servers are computers or programs providing network services; the routers are computers devoted to 
arrange traffic and data exchange across the Internet. Similarly, hosts and routers are linked by var- 
ious types of connections, that are undirected and have different traffic capacity depending on their 
bandwidth. 

The structure of the Internet is the result of a complex interplay between growth and self-organization, 
involving processes of cooperation and competition, without central administration or external control. 
A good knowledge of the topological structure of the Internet is necessary to improve its functionality, 
and prevent the system from faults and traffic congestions. This is the main reason for the great 
interest of researchers in studying the structure of the Internet. 

An exhaustive introduction to the network properties of the Internet is provided by Refs. (2031 llfij : 
we give here only some necessary information on its structure. 

The first important observation is that it is impossible to keep track of all single hosts, that are hun- 
dreds of millions all around the world, organized in complex hierarchically structures, whose smaller 
units are Local Area Networks, connected to the main net by means of routers. Monitoring the struc- 
ture of single local networks is thus too difficult, and partly useless since these networks are created 
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Figure 2.8: Graphs representations of the Internet at the autonomous systems level (left) and the 
router level (right). Both graphs are based on data collected by the Internet's mapping projects at 

CAIDA jsg. 

just to connect hosts inside buildings, university departments, corporate networks, city areas etc, and 
their properties depend on very local administrative policies. Hence, the lowest level of granularity 
at which we can analyze the Internet topology is the so-called router level (IR), that is a graph with 
routers as vertices and the physical connections among them as edges. 

At a higher granularity level, the Internet can be partitioned into autonomously administered do- 
mains, called Autonomous Systems (AS). Within each autonomous system, whose structure is not 
defined on the basis of geographical proximity but more frequently on commercial agreements and 
policies, the traffic is handled following proper internal control strategies and restrictions, that can 
vary considerably from AS to AS. In order to better understand routing problems, it is very convenient 
to study the Internet topology at the level of the autonomous systems, considering each AS as a node 
and connections between different ASs as the links. 

The mapping projects of the Internet deal mainly with these two scales of description: the autonomous 
systems level (AS), and the router level (IR). Fig. 12. 71 reports a scheme of the structure of the Internet 
at both levels. For many years, the structure of the Internet has been considered similar to that 
of a random graph, with a homogeneous degree distribution peaked around a characteristic degree 
value. In the last decade, on the contrary, massive Internet measurements provided evidences against 
such kind of modeling and in favor of topologies with heterogeneous degree distributions. Historically, 
the first experimental evidence of a power-law degree distribution at the AS level is contained in a 
famous paper by Faloutsos et al. |120| . who analyzed the data collected during the period 1997— 1998 
by the National Laboratory for Applied Network Research (NLANR) |19fi| . In the following years 
many other studies, both at the AS and the router levels, always confirmed this important discov- 
ery [HSI IH71 I2T51 IT551 I5T51 1T571 12211 175] , Though the qualitative picture coming from these 
two different scales is the same, the two graphs show relevant quantitative differences, that can be 
examined in depth using the statistical tools introduced in the previous section. Note that, the size 
of the Internet is exponentially growing, but the major statistical measures do not change in time, 
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Figure 2.9: Two examples of the degree distributions obtained from Internet mapping projects. P{k) 
is clearly power-law at the AS level (left), while it is broad but slightly bended at the IR level (right). 
Both graphs are based on data collected by the Internet's mapping projects at CAIDA |62j. 



suggesting the idea that the Internet, as a physical system, is in a sort of non-equilibrium stationary 
state. Two typical pictures of the Internet's AS and IR levels, obtained from the data of the CAIDA's 
mapping project are displayed in Fig. 12.81 

Apart from the quantitative estimation of the exponent, that can be possibly affected by measurement 
biases, the existence of heavy-tails seems to be a solid feature of the Internet (see Chapter for a 
complete statistical analysis of the Internet exploration's technique). In Fig. 12.91 we report the degree 
distributions for the AS (left) and IR (right) levels, that are clearly power-laws P(k) <~ fc~ 7 (with 
7 ~2.1). 

In both cases, the distribution of the shortest path length is peaked around the average distance (£), 
whose very small value is signature of the occurrence of the small-world property. 
Differences in the two levels of descriptions can be found looking at the degree correlations. The de- 
gree dependent spectrum of the average degree of nearest neighbors k nn {k) is roughly flat or slightly 
increasing with k for IR maps, while it shows a very clear disassortative behavior at the AS level. The 
reason for disassortative correlations can be found in the strong hierarchical structure of the Internet 
at the autonomous system level, that is absent at the router level. This hierarchical organization of the 
Internet at the AS level is reflected also in other measurements, as for instance its fc-core structure, 
that is discussed in Refs. [Ml E3 El El 13 EEHj (see also Section IT2~5|l . 

Both levels show similar average clustering coefficient, that is considerably higher than in standard 
random graphs, reinforcing the idea that the Internet is far from being locally tree-like. The spectrum 
of the clustering coefficient is almost constant at the router level and clearly decreasing at the AS level. 
The power-law functional dependence of c(k) on the degree k has been interpreted as a signature of 
the presence of modular structures at different scales |21()| . 

Finally, the distribution of the node betweenness centrality Pb(b) is clearly power-law for the AS 
(Pb{b) ~ b tb , with 7f, ~ 2.0), while it shows a very broad distribution with a more bended shape 
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name 


date 


N 


(k) 


h 

"'max 


(fc 2 )/(fc> 





(c) 


AS RV 


2005/04 


18119 


3.54771 


1382 


2369.82 


3.92 


0.083 


AS CAIDA 


2005/04 


8542 


5.96851 


1171 


521.751 


3.18 


0.222 


AS Dimes 


2005/04 


20455 


6.03862 


2800 


1556.24 


3.35 


0.236 


IR Mercator 


2001 


228297 


2.79635 


1314 


36838.6 


11.5 


0.013 


IR CAIDA 


2003 


192243 


6.33085 


841 


8884.23 


6.1 


0.08 


IP Dimes 


2005 


328011 


8.2142 


1453 


10954 


6.7 


0.066 



Table 2.1: Main characteristics of the Internet obtained by various mapping projects: number N of 
vertices, average ((h)) and maximal (fc max ) degree, ratio (fc 2 )/(fc), average shortest path (£) between 
pairs of vertices, and average clustering coefficient (c) . For the Autonomous Systems level we consider 
maps of the Oregon route-views (RV) project EE], the setter project at CAIDA and the DIMES 
project |HHj; at the router level, data about recent maps by CAIDA and DIMES, together with older 
results from the Mercator project |135| . are reported. 

at the IR level. The correlation between betweenness and degree is almost linear [203) (but large 
fluctuations emerge using scatterplots instead of average values). 

This picture of the Internet, however, is correct only at a qualitative level, whereas quantitatively, 
different mapping projects provide slightly different results for the average properties of the network, 
in relation to the unequal node and edge coverage of the measurements. An idea of such a variety of 
results is given by the list reported in Table [2~71 

The validation of Internet's large scale measurements is of primary interest to understand correctly 
topological and functional properties of the real Internet, and will be the subject of the next chapter. 
In summary, while the topological structure of the Internet at the router level is still hard to detect 
probably because of the unreliability of mapping processes, at higher granularity level, for the ASs, 
the Internet structure is much clearer and appears as a mixture of hierarchical modular structures with 
degree heterogeneity and small-world property. 

2.3.2 The World-wide Air-transportation Network 

The World-wide Air-transportation Network (WAN) is the network of airplane connections all around 
the world, in which the vertices are the airports and the edges are non-stop direct flight connections 
among them. As for the Internet, this is a physical network, in the sense that both the nodes and the 
edges are well-defined objects. We can easily as well define weights on the edges, since different flights 
are characterized by a different number of passengers and by the geographical distance they have to 
cover. 

For the moment we do not consider geographical properties, that will be taken into account in Chap- 
ter and define the weight Wij of the link as the total number of available seats per year in 
flights between airports i and j. 

The analyzed dataset was collected by the International Air Transportation Association (IATA) for 
the year 2002. It contains N = 3880 interconnected airports (vertices) and E = 18810 direct flight 
connections (edges). This corresponds to an average degree (k) = 9.7, with maximal value k c = 318. 
Degrees are strongly heterogeneously distributed, as confirmed by the shape of the degree distribu- 
tion, that can be described by the functional form P(k) ~ fe _7 /(fc/fc c ), where 7 ~ 2.0 and f(k/k c ) 
is an exponential cut-off which finds its origin in physical constraints on the maximum number of 
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Figure 2.10: The strength distribution P s (s) (left) and the diagram strength vs. degree (right) for the 
World-wide Airport Network. The diagram s(k) shows the non-linear correlation between strength 
and degree (black circles) , whereas the same quantity grows linearly after weights randomization (red 
squares). The inset reports the behavior of the degree distribution. 

connections that can be handled by a single airport |141l 1140] . 

Moreover, the WAN shows small-world property, since the average distance is (£) — 4.4. It is worthy 
noting that weights from IATA database are symmetrical, that is probably a consequence of the traf- 
fic properties and allows to consider the network as a symmetric undirected graph. The analysis of 
weighted quantities reveals that both weights and strength are broadly distributed, and that they are 
non-trivially correlated with the degree, since the average weight (ify) ~ (kikj) 6 with 9 ~ 0.5 and 
s(k) ex. k@ with j3 ~ 1.5 |23j . The sets of data in Figure VI. 101 representing the strength distribution 
P(s) and the strength-degree diagram s(k), are borrowed from Ref. |23| . Such super-linear relation 
points out that highly connected airports tend to collect more and more traffic, that could in principle 
yield to a condensation process of the traffic on the hubs (i.e. a finite fraction of traffic handled by a 
small number of airports). 

The non-trivial role of the weights is witnessed also by degree-degree correlations and clustering that 
show a slightly different behavior if weights are considered (not shown) . The topological average 
nearest neighbors degree k nn {k), indeed, reaches a plateau for high degrees, while the weighted quan- 
tity k^ n {k) still increases, showing that even if the degree of neighboring nodes is various, the hubs are 
preferentially connected with high traffic nodes. A similar interpretation holds also for the observed 
values of the weighted clustering coefficient c w (k) for high degree nodes, that is larger compared to 
the topological one. 

2.3.3 Other Examples 

It is possible to distinguish three main types of real complex networks: 

• artificial infrastructures and technological networks; 

• social networks generated by interactions between individuals; 

• networks existing in nature, such as food webs or biological networks. 

The Internet and the WAN are the most popular examples of the first class, that contains many other 
communication and transportation networks [301 11271 12471 12*1 1551 1222| . Not all of them show a broad 
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degree distribution, but a considerable amount of heterogeneity can be recovered at a traffic level, as 
recently shown by De Montis et al. in the case of the Sardinian transportation network. 
The panorama of biological networks is very wide, and its analysis goes beyond the purposes of this 
thesis. For instance, in cellular networks, the nodes are genes or proteins and the links are metabolic 
fluxes regulating cellular activity. In a seminal work |152| . Kauffman showed that chemical processes 
can be conveniently represented by chemical reactions networks. The vertices are substrates, connected 
by the chemical reactions in which they take part. The orientation of the corresponding edge says 
if a substrate is involved in the reaction as product or "educt" . The average size of these networks 
is quite small (N ~ 0(1O 3 )), but despite of the small size, the degree distribution is fairly broadly 
shaped. These networks are small- worlds and their structure seems to be rather robust under random 
defects, errors and mutations. 

Another class of biological networks are protein-protein interaction networks (PIN) (see for instance 
Refs. |15HI 1239| 12041 1148) ). in which the edges identify the existence of interactions between two 
proteins. They present heterogeneous degree distribution and low average distance, but also non- 
trivial pair correlations, that are related with the proteins functions. Indeed, by means of modularity 
analyses |210l 1178) . it is possible to uncover the relation between some small topological structures 
(like triangles or small cliques) corresponding to some functional modules of proteins. 
Other interesting natural networks are food webs |251l 11091 I20fi| . i.e. networks of animals in a given 
ecosystem, in which directed arrows establish "who eats who" in the food chain. In this case also 
self-links must be considered, as consequence of the presence of cannibalism in many animal species. 
However, as we will show in another context in Chapter the problem of counting all species of an 
ecosystem makes the definition of these networks quite difficult. Similar difficulties are faced in the 
definition of trophic links between species. In addition, these networks are very small (N ~ O(10 2 )) 
and their degree distributions are not clearly fat-tailed |1 10| . The clustering coefficient is, instead, 
very large showing that they are far from being random graphs, even if they present features that 
are typical of small- world models. Another large field of application of the network description are 
social sciences. Social networks are used to represent social interactions among individuals (called 
actors), such as acquaintances, collaborations, sexual relations, etc. The typical structure of social 
networks is that of multi-partite networks, with a set of nodes representing actors and other sets 
of nodes representing affiliations they belong to (Fig. I2.11JI . Actors are indirectly linked together 
by means of common affiliations, i.e. the places they frequent, the office in which they work, etc. A 
standard technique to study social networks consists of projecting multi-partite networks on unipartite 
representations, in which nodes of a unique type (the actors) are linked by an edge if they have at 
least one common affiliation. Interesting examples of such networks are co-authorship networks, the 
most popular one being the network of collaborations among physicists submitting manuscripts on the 
well-known archive called "cond-maf |187j . This database of condensed-matter physics contains 
N = 12722 scientists who authored e-prints during the period from 1995 to 1998. According to 
the unipartite description, the presence of an edge between two authors means that they have co- 
authored at least one paper. Obviously, the link between authors who have co-authored many works 
together should be stronger, and we can assign weights to the edges following the definition proposed 
by Newman [T52| . 



It is a public archive of condensed matter preprints and e-prints, see 'http://www.arxiv.org/archive/cond-mat '. 
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Figure 2.11: (Top) Example of bipartite representation of a social network: blue nodes with numbers 
are the affiliations, and white nodes with letters are the actors. (Bottom) The figure represents the 
unipartite projection of the network, in which pairs of actors are connected by links if in the original 
network they have at least one common affiliation. 

where 5f is 1 if the author i has contributed to the paper p and otherwise, and n p is the number 
of authors of the paper p. Both the degree distribution and the strength distribution are broad, but 
the weights are linearly correlated with the degree, revealing that their distribution is independent of 
the topology. The network of " cond-mar shows, moreover, an interesting community structure, with 
many induced subgraphs of different sizes, corresponding to different research fields jlffifl II lj. 
A further level of information that sometimes is available in many co-authorship networks is the 
number of citations gained by a paper. In such a case the weights are defined a la Newman but 
including the number of citations |49j . A very interesting issue is the study of the temporal evolution 
of co-authorship networks, identifying those topological and weighted structures that reinforced during 
the years and those which got lost. This kind of analysis has been carried out in Ref. |49] . focusing 
on the study of the temporal evolution of the impact of co-authors groups in a particular scientific 
community (the database analyzed was the Info Vis Contest dataset, a network of 614 papers by 1036 
authors, published between 1974 and 2004). 
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2.4 Networks modeling 

In the previous section, we have reviewed some examples of real networks, from which we conclude 
that a networked description can be applied to a variety of systems with a large number of interacting 
units, independently of their functions and role in nature or society. This makes evident the lack 
of a unique underlying theoretical framework in which all networks properties may be analyzed and 
interpreted. 

In order to build such a theoretical framework starting from phenomenological data, the first step 
consists of networks modeling. We can roughly divide networks in two classes: static networks and 
evolving networks. In the first class, the overall statistical properties are fixed and single networks are 
generated using static algorithms. A typical example of this class is the static random graphs ensemble 
defined by Erdos and Renyi |115l I116| . Classical random graph models are generated drawing edges 
uniformly at random between pairs of vertices with a fixed probability. The resulting graphs have 
poissonian degree distributions (Erdds-Renyi Model). Recently, this ensemble of graphs has been 
extended in order to include graphs with any possible form of the degree distribution (Configuration 
Model) |180l 11811 1551 12*1 ITT]. In this section, we introduce these two models together with another 
static model, that was proposed by Watts and Strogatz as a toy-model able to reproduce the main 
properties of real small- world networks ( Watts- Strogatz Model) [247] . 

Evolving networks, on the contrary, are a very recent topic, that has attracted most attention after 
the discovery of broad degree distributions in real growing networks such as the Internet, and the 
possibility of producing power-law degree distributions by means of very simple evolution rules. In 
this case, the generation algorithm of the network implies a non-equilibrium process in which the 
statistical quantities evolve in time. However, all these growing models are built in such a way that 
the infinite size (and thus time) limit gives well-defined statistical properties. In the following, we 
will describe only two models of growing networks with power-law degree distribution, the famous 
Barabdsi- Albert Model |17j and its clustered version proposed by Dorogovtsev, Mendes, and Samukhin 
(DMS Model) [TOB] , 

Though from the point of view of the application to the description of real networks the division 
in static and evolving networks is important, for the purposes of this thesis, i.e. for the study of 
dynamical processes taking place on networks, a better classification is that of considering the two 
following classes: homogeneous networks, with degree distribution peaked around the average value; 
and heterogeneous networks, in which the degree distribution is skewed and may present heavy-tails, 
or more generally, large fluctuations around the average value. 

Many real networks evolve in time, but, in fact, the temporal scale of their topological rearrangements 
is usually much longer than the time-scale of the dynamical processes occurring on the network. This 
property allows to study dynamical processes on networks that have been obtained with a growing 
mechanism as if they were static models. 

2.4.1 Homogeneous Networks 

Erdds-Renyi Model (ER) - As already mentioned the theory of random graphs was founded 
by P. Erdos and A. Renyi [1151 Ill5| . who defined two random graphs ensembles Qn,e and Gn.p, in 
which the fixed quantities are, respectively, the number of nodes and edges in the former case and the 
number of nodes and the linking probability in the latter. In Gn,e, the graphs are defined choosing 
uniformly at random pairs of nodes and connecting them with an edge, until the number of edges 
equals E. The second case is practically more convenient, and is defined as follows. Starting from N 
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Figure 2.12: Typical degree distribution (left) for a homogeneous poissonian random graph (right). 

nodes, one connects with probability p each pair of nodes. At the end of the procedure, the average 
number of edges is E = pN(N — l)/2 and the average degree is 2E/N — p(N — 1) ~ pN, if N is 
sufficiently large. For a finite number of nodes, the probability that a node has degree k is given by a 
binomial law 

P(k)=C k N _ 1 p k (l-pf- 1 - k , (2.19) 

where p k (l — p) N ~ 1 ~ k is the probability of having exactly k edges and C^ r _ 1 = gm(r£i~Wj ^ s ^ ne 
number of possible ways these edges can be selected. Taking the limit N — > oo and p — > 0, in such a 
way that pN — ► (fc) = const, the binomial law tends to a Poissonian distribution 

P(jfc) ~ ^j-e-< fc> . (2.20) 

Poissonian distributions are very peaked around the average value, with bounded fluctuations; indeed, 
the second moment is finite, (k 2 ) = (k) 2 + (k). An example of such degree distribution is displayed in 
Fig. l2.12l (left). while the right panel in the same figure displays a sketch of its graphical representation. 
Since the probability of finding nodes of degree much larger than (k) decreases exponentially, these 
graphs are prototypes for homogeneous networks. 

However, ER random graphs show a giant component only for p > p c with p c = 1/N, that corresponds 
to the critical average connectivity (k) = 1. Erdos and Renyi ( |115l ITT?>j ^ proved the existence of a 
transition between a phase, for p < p c , in which with probability 1 the graph has no component of 
size larger of O(logiV), and a phase, for p > p c , in which the graph has a giant component or order 
O(N). In the marginal case p = p c , the largest component has size C(iV 2 / 3 ). This second order phase 
transition belongs to the same universality class of the mean-field percolation transitions |48j . 
The ER random graphs are completely uncorrelated, thus the average degree of the nearest neighbors 
is a constant independent of k. The clustering coefficient is simply the probability that any two 
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neighbors of a given vertex are also connected to each others, i.e. 

<c)=P=^ T - (2.21) 

The fact that in random graphs the clustering coefficient vanishes in the limit of large size N justifies 
the local tree-like approximation (Bethe Tree) used to obtain many relevant results. 
For instance, the tree-like approximation allows to compute how the diameter of the graph scales 
with N. Indeed, since each node has typically (k) neighbors, at distance £ in a tree-like topology, the 
number of visited nodes scales as (k) e . The diameter is reached when the number of visited nodes is 
equal to N, but the shortest path distribution is very peaked, thus we can approximate the diameter 
with the average distance, (I), obtaining 

loe N 

W ~ Att • (2.22) 
log(fc) 

Watts- Strogatz Model (WS) - In Section l2~3l we have shown experimental data from which 
it emerges that the average hop distance between two vertices in real complex networks is very small, 
and it is possible to reach every vertex in a small number of steps. Nevertheless, random graphs are 
not optimal models for the study of real networks, since the most of them arc clustered, i.e. they 
contain a lot of triangles, whereas random graphs are locally tree-like. In order to overcome this 
problem, Watts and Strogatz proposed a simple model interpolating between a regular lattice with 
large average distance but strong clustering and the random graph with small diameter and small 
clustering |247j . 

The initial network is a one dimensional m-banded graph, i.e. a ring of N sites in which each vertex 
is connected to its 2m nearest neighbors. The vertices are then visited one after the other: each 
link connecting a vertex to one of its m nearest neighbors in the clockwise sense is left in place with 
probability 1 — p, and with probability p is rewired to a randomly chosen other vertex. The long- 
range connections introduced play the role of shortcuts connecting regions that are very distant in the 
original network. Figure 12*. 131 displays a sketch of the rewiring mechanism. 

For p = 1, the network is completely randomized but, since it has at least degree 2m, it is not 
equivalent to a random graph. The interesting regime is for 1/N <C p <C 1, in which a still rather 
large clustering coefficient coexists with a logarithmically scaling average distance. In the limit p — > 0, 
the small-world property disappears and the metric structure of the lattice is restored. It has been 
shown ( J29ll32lll9f)lll94| ) that the transition occurs precisely at p = 0, and in the infinite size limit the 
average distance diverges as (£) ~ 1 /p. This cross-over phenomenon for increasing rewiring probability 
plays an important role in determining the behavior of dynamical processes defined on this type of 
network (see Chapter [5] for the case of Naming Game on Watts-Strogatz small- world networks). The 
degree distribution of this model in the regime 1/N <C p *C 1 has the form 

min(fc-m,rn) / \ . ,k-m-i 

U V 1 J (k-m-%)\ 
for k > m, and is equal to zero for k < m. 

The clustering coefficient can be easily computed recalling that two neighbors that are linked together 
in the original model remain connected with probability (1 — p) 3 , 

, , , . 3(m — 1) , s o , 
<^>~2^^ (1 - P) ■ ^ 
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increasing randomness 



Figure 2.13: Rewiring process in the Watts-Strogatz model of smallworld network. Increasing the 
rewiring probability p, the network passes from a regular 2-banded network (left) to a smallworld 
network (center) and finally to a random graph (right). 

2.4.2 Heterogeneous Networks 

In the last years, a huge amount of experimental data yielded undoubtful evidences that real networks 
present a strong degree heterogeneity, expressed by a broad degree distribution. In order to reproduce 
the main features of this new class of heterogeneous networks, a big effort has been devoted to network 
modeling, and a large number of models with degree heterogeneity has been put forward. The main 
feature of these networks is that the average degree is not representative of the distribution, and the 
second moment (fc 2 ) is very large, possibly diverging in the infinite size limit. A characterization of the 
heterogeneity level of a degree distribution is given by the parameter (k 2 )/(k), that is strictly related 
with the expression of the normalized fluctuations and enters in the description of many dynamical 
phenomena on networks, such as percolation and epidemic spreading. 

When the degree distribution is a power law P(k) ~ fc -7 (2 < 7 < 3), the fluctuations diverge with 
N (the average remaining finite), and nodes with very large degree appear in the network. However, 
not all heterogeneous networks are power-law, many of them possess bended degree distributions that 
cannot be classified as power-laws. A broad distribution, that is not scale-free, but has been used 
to fit a number of data coming from the Internet's measurements f |57|) is the Weibull (WEI), whose 
form is P(k) = (a/c)(k/c) a ^ 1 exp(— (fc/c) a ), with a, c real positive constants. Weibull distributions 
are good candidates as degree distributions also for networks of scientific collaborations, wordwebs 
and biological networks, where the existence of a neat power-law is still under debate. 
Now, we introduce the most relevant models of heterogeneous networks, that will be used in the 
numerical simulations related to the investigations reported in the next chapters. 

Configuration Model ( CM) - This is a static model of scale-free graphs that generalizes the 
random graph ensemble of Erdos and Renyi to generic degree distributions. It is particularly useful 
to study dynamics on network models with a given degree distribution and controlled correlations. A 
famous algorithm to generate generalized random graphs was proposed by Molloy and Reed |18()U18i| . 
A degree sequence {ki} (i — 1, . . . , N) is drawn randomly from the desired degree distribution P(k) 
and assigned to the N nodes of the network, with the additional constraint that the sum ki must 
be even. At this point, the vertices are connected by J^. fcj/2 edges, respecting the assigned degrees 
and avoiding self- and multiple-connections. Figure 12 . 141 reports a sketch of the generation procedure. 
The last condition introduces unexpected correlations, producing a slightly disassortative behavior. 
In order to eliminate correlations, Catanzaro et al. (71) have proposed a variation of the model 
characterized by a cut-off at y/~N for the possible degree values. In the rest of this work, when talking 
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Figure 2.14: Illustration of the Molloy-Reed algorithm used to generate a generalized random graph 
starting from a degree sequence. The stubs (left) are linked in such a way that self-links and multi-links 
are avoided. 

of configuration model we will always refer to this particular model, that is known as Uncorrelated 
Configuration Model (UCM) and presents flat k nn {k) and c(k). 

Barabdsi- Albert Model (BA) - The first attempt to model real growing networks such as 
the Web was provided by Barabasi and Albert, who proposed the idea of preferential attachment as 
the central ingredient in order to get a power-law degree distribution J7j. The preferential attach- 
ment is based on the simple idea that, during the network's evolution, new coming nodes become 
preferentially connected with nodes that already have a large number of connections. It was proposed 
as a "construction recipe" for the Web, in which new pages acquire more visibility if they link to very 
important webpages, but it can be assumed as a valuable principle for a large number of technological 
and social networks, in which nodes want to optimize their conditions connecting with very important 
and central nodes. During the growth, the "rich gets richer" effect is produced: large degree nodes 
more easily increase their degree compared to low degree nodes. 

The algorithm starts from a small fully connected core of mo nodes (their precise properties do 
not change the statistical properties of the model in the large size limit). At each time step t — 
1,2, ... ,N — mo, a new node j enters the network and forms m < m edges with distinct existing 
nodes: the probability that an edge is created between j and the node i is 

= =fi- . (2.25) 

Every new node has m links and the network size at time t is N(t) = t + mo; since the number of links 
is E = mt, in the large time limit the average degree is simply (k) = 2m. The degree distribution 
of the B A model can be obtained by means of different methods (mean-field approximation |17l IT%] , 
rate equation |162| . or master equation 105 ), and shows, in the limit t — + oo, a power law behavior 
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Figure 2.15: Typical degree distribution (left) for a heterogeneous BA model (right). 

P(k) r~j k^ 1 with 7 = 3. Figure 12.151 displays the degree distribution and a graphical representation 
of a BA network. 

Apart from the very interesting idea of preferential attachment, the Barabasi- Albert model is a very 
peculiar network, with flat degree correlations and almost vanishing clustering. Many variations of 
the model have been proposed, including node aging |157j . fitness |4T)llll7| . edge rewiring [3], limited 
information |186| . etc; in particular, the addition of a constant A representing the initial attractivity 
{ki — > ki + A in the kernel of Eq. 12.25(1 allows to generate power law networks with desired exponent 
2 < 7 < oo |l()5j . Note that the linearity (in k) of the attaching kernel in En. 12.251 is a necessary 

condition to get a power-law distribution. It has been indeed proved ( |162lllfifi) ) that using generalized 

k p 

kernels of the type Tlj^i = ' g the degrees of the emerging network are power-law distributed only 
for p = 1. When p < 1, the degree distribution is exponentially shaped; when p > 1 the evolution 
produces edge-condensation on few vertices. 

Dorogovtsev-Mendes-Samukhin Model (DMS) - Another interesting variation of BA 
model is the growing clustered network proposed by Dorogovtsev, Mendes and Samukhin (DMS) 
106 , in which the evolving rule consists in attaching new vertices to the extremities of a randomly 
chosen edge. The probability to choose a given vertex is thus proportional to the number of edges of 
which it is an extremity, i.e. to its degree. While the DMS model generates a power law distribution 
(7 = 3), the growing rule does not explicitely need any a priori knowledge of node's degrees. The only 
real difference between this model and the BA model is in the clustering coefficient, that in the DMS 
model is very high ((c) ~ 0.739 for m = 1). The general behavior of the clustering spectrum c(k) 
and the average value (c) for the DMS model has been computed using the rate equation formalism 
in Ref. [2E1 

2.4.3 Weighted Networks 

In this last paragraph, we discuss a model of growing weighted network that is particularly appropriate 
for the description of the World-wide Air-transportation Network. 
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Figure 2.16: Weights redistribution rule in the BBV model. When a new node j enters the network 
attaching to a node i, it carries new traffic that is redistributed to the neighbors of i by means of a 
contribution to their weights. The edge assumes a fixed weight wq. 

In real weighted networks, the weights are not fixed but evolve in time together with the topology, 
so that the characteristics of the networks depend on the interplay of these two types of dynamics 
(topological and weighted) and on the relation between their time-scales. As for the purely topological 
models, also for weighted evolving networks there is a variety of slightly different models, each one 
pinpointing a particular aspect of the evolution. We are interested in the situation in which the 
evolutions of links and weights are coupled and have the same temporal scale, that is similar to what 
happens in the real network of airports. 

In the case of the WAN, in fact, each new airport j that enters the existing network brings new flights, 
and new passengers are introduced into the system. A fraction of these passengers will not stop at the 
destinations of the direct flights they have taken from j (e.g. a new airport is connected only to New 
York but part of the passengers would like to go to other cities, such as Philadelphia or Chicago); 
hence, part of the traffic is immediately locally redirected on the network. Moreover, new airports try 
to connect themselves to the most connected hubs, in order to guarantee many correspondences to the 
passengers. Consequently, a good way to naively model the growth of a network like the WAN is that 
of considering 1) a preferential attachment procedure based on the strength and 2) a redistribution 
rule for the local traffic brought by the new node. More precisely, the Barrat-Barthelemy-Vespignani 
(BBV) model j^lEHlGS] starts from an initial clique of mg nodes with a fixed weight wq. At each 
time step, a new vertex j is added to the network with m edges (of weight wq) that are randomly 
attached to a previously existing vertex i according to the probability 
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Immediately after the creation of the new link the weight of links (i, I) connecting i with each 

other neighbor I is locally redistributed according to the rule (see Fig. I2.16|) 



i.e. the increase of the traffic S is locally distributed among the neighboring connections, each link 
receiving a fraction of traffic that is proportional to the amount of traffic already handled by that 
connection. 

This network model displays power law distributions of degree, strength and weights. An interesting 
feature of the model is the presence of non-linear correlations between degree and strength. However, 
it is worthy noting that the model fails in reproducing the large fluctuations that characterize the 
quantity s(k) in the case of the WAN (and other growing weighted networks), and that are clearly 
visible in the scatter-plot degree vs. strength. 

In a recent publication [2U , the authors of the model have shown that large fluctuations may be due 
to the existence of spatial constraints coming from the embedding of the nodes in a two-dimensional 
euclidean space. This modified model consists in considering the nodes deployed on a two-dimensional 
euclidean space: each new node is situated in a randomly chosen point of the lattice, and the prefer- 
ential attachment kernel is modified in order to account for the fact that a node prefers to connect to 
nodes that are well-connected but also spatially close to itself, 



where dij is the euclidean distance between nodes i and j, and r\ is related to a characteristic length 
scale. When r\ is small enough that spatial effects cannot be avoided, the relation between degree 
and strength is still non-linear but now presents very large fluctuations (around the average value 
expressed by s(k)). 

This model will be furtherly investigated in the analysis of the vulnerability of weighted networks 
(Chapter 0J, in order to explain some results obtained for the real airport network. 
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Chapter 3 

Exploration of complex networks 



3.1 Introduction 

The present chapter is devoted to describe and study the exploration techniques of complex networks. 
Motivations of this research and a general introduction to the problem, in which we highlight several 
different sampling methods, are provided in Sections 13.1. 113. l~2l Then, in Section 13.21 we focus on 
a theoretical model for the exploration of the Internet, that is analyzed using a typical mean- field 
statistical physics approach. A variety of different measures are introduced to investigate the main 
properties of the exploration process and its biases. Finally, exploiting an interesting application of 
non-parametric statistics, we propose an approach to compensate the biases fSection l3.3fl . 

3.1.1 Motivations 

Network modeling is undoubtedly the favorite tool used by researchers to understand the origins of 
the ubiquity of complex networks in the real world. In particular, the presence, in biological as well as 
technological systems, of the same peculiar topological properties, such as a broad degree distribution 
and very small average inter-vertex distances, is very intriguing. By means of network models, some 
of the phcnomenological results have been reproduced, and possible explanations for many observed 
properties have been put forward. An aspect, on the contrary, that has been relatively disregarded is 
the validation of phenomenological data, and the identification of possible errors or biases occurred 
during the process of data collection and analysis. The importance of this issue resides in the fact 
that "systematic" errors in the statistics, due to sampling biases, could compromise the reliability of 
the data and of the observed properties of real networks. 

The idea that the sampling of networks may introduce biases is far from being unrealistic, as proved 
by several examples reported in the Scction f3 . 1 . 21 and coming from different fields of complex networks 
research. 

In social and biological networks, the limited information about the exact mechanisms generating the 
original network and the amount of arbitrariness in the definition of the edges (e.g. relations among 
individuals, interactions between proteins, etc), makes sampling processes extremely problematic, 
since we do not have a real control on the origins of possible biases. 

Completely different is the case of the physical Internet, in which nodes and edges are well-defined, 
but the dynamical nature of its structure and the lack of any centralized control have favored a self- 
organized evolution of the system, without any information on its topological properties. In practice, 
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we do not have a complete knowledge of router's neighbors, since routing decisions depend on optimized 
traffic protocols by means of which data packets should be sent along the shortest path available to the 
destination. This means that routers only know which are the neighbors belonging to the shortest 
paths; i.e., they could ignore the existence of other neighbors. Actually, traffic congestions and local 
policies can force routers to deliver packets through some preferential neighbors, causing small path 
inflations with respect to the shortest one. 

Internet's explorations, obtained by means of tree-like probes based on traceroute processes, exploit 
the routing protocols in order to trace a path between different nodes of the network. In this way, 
they suffer of important biases due to the loss of lateral connectivity, i.e. of those nodes or links which 
do not lay on the shortest path between two nodes (or on its small perturbations). We will see that 
traceroute explorations can seriously misrepresent the degree distribution of the original network. 
On the other hand, a good knowledge of the Internet topology is fundamental in order to improve 
its performances, minimize traffic congestions and protect the system against malicious attacks. For 
this reason, the study of Internet's sampling biases is of primary interest, not only for the scientific 
community but also for practitioners and common users. 

The investigation has to be carried on at different levels; our theoretical formulation of the problem 
is aimed at 

• understanding what is the origin of the exploration biases and at what stage they affect the 
observed properties, 

• identifying which kind of topologies yield the most accurate sampling, 

• providing some "rules of thumb" for the optimization of mapping strategies, 

• obtaining alternative approaches able to correct the biases, at least in some simple cases. 

In the next section, we introduce the issue of sampling biases in complex networks, highlighting which 
dramatic effects of distortion of the shape of the degree distribution can be produced by a bad sampling 
of the network. 

3.1.2 Networks sampling methods and their biases 

There are many possible sources of sampling biases in complex networks, depending on the field 
of research and the type of sampling method used in the experiments. Here we provide a short 
survey of examples, focusing in particular on the mechanisms of homogeneous sampling and tree-like 
explorations and on the dramatic effects they can have in misrepresenting the degree distribution of 
the underlying network. 

Sampling biological and social networks - In the context of social and biological networks, 
practitioners have developed many types of experiments in order to gain information on the topology 
of the networks of interest. Apart from the complexity of the experimental set up necessary for such 
experiments, a deep conceptual problem emerges: ties between nodes are usually associated with 
relations or interactions, thus they can have different nature or intensity, that are difficult to evaluate 
and take into account correctly. In social networks, this is due to the level of arbitrariness in defining 
relations between actors, and in biological networks to the fact that measures are usually indirect 
and may be influenced by the effects of external unknown variables. Nevertheless, many authors 
have modeled the collection of data in social as well as in biological experiments by means of node- 
or edge-picking sampling algorithms. The two algorithms give similar results, thus let us focus on 
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Figure 3.1: Illustration of a node-picking sampling algorithm used to model sampling of biological 
and social networks. The original network (top) is sampled picking up nodes at random and retaining 
common links (red nodes in the bottom figure). 

the node-picking case. Fig. 13.11 illustrates a node-picking sampling process. In absence of further 
information on the way a node is selected, we can assume a homogeneous sampling, i.e. a node is 
included in the sampled subnet with fixed probability p, and left out with probability 1 — p; only 
edges between sampled nodes are retained. Then, if P(k) and P(k) are respectively the original and 
sampled degree distributions, they are related by a "poissonian filter", 



Using generating functions formalism |233l Ilfi8| . it is easy to show that, when p ~ 1, the deviation 
from the original distribution is negligible both for homogeneous and heterogeneous degree distribu- 
tions. When the sampling probability is low (p <C 1), homogeneous distributions are conserved, even 
if the average connectivity is reduced by a factor p, whereas in the case of power-law distributions the 
observed exponent may be different from the real one. Subnets have more nodes with relatively few 
connections, due to the sampling process, but very large degree nodes are usually well-represented, 
so that the original power-law behavior is recovered for k 3> 1. Consequently, the degree distribution 
appears slightly concave in the middle (in log-log scale), introducing biases in the measurement of the 
exponent (it is systematically reduced) . Actually, a good strategy is that of looking at the tail of the 
distribution in which the sampling is systematically more accurate. 

Unfortunately, sampling by means of real experiments is far from being uniform, since external fac- 
tors can be correlated to the properties of some nodes, favoring their sampling. For instance, social 
networks are usually bipartite, i.e. actors (individuals) are linked together via multiple interaction 
contexts or affiliations. Following Refs. j!58j . together with the random exclusion of actors or affilia- 
tions, there are other two principal mechanisms causing data missing: actors unpredictable decision 
of non-responding to a particular survey, and of providing fixed or preferential choices in the answer. 
A whole field of social network analysis is involved in studying how to predict the correlations among 




(3.1) 



38 



CHAPTER 3. EXPLORATION OF COMPLEX NETWORKS 



such data missing events (see Refs. |136l I215| and in particular Ref. 245 and references therein). We 
prefer to consider a simpler example coming from biology and reported in Ref. |205j . that shows how 
hidden variables influencing the sampling process can have dramatic effects on the results. 
Let us consider a protein-protein interaction network (PIN), in which the nodes are proteins and the 
edges are the interactions between them. The standard methods to detect interactions are two-hybrid 
assays and mass spectrometry |2(J5) . Both of them are sensitive to the physical conditions in which 
the experiment is performed (e.g. the temperature, the solubility degree of proteins, etc). The de- 
tectability of an interaction can be affected by these external variables, whose effect on the process 
is not completely known. Similarly, in neural networks, and cell regulatory networks, some nodes or 
edges may be ignored in the experiments only because their functions are temporarily inhibited by 
the activation of other functions. 

In order to model the sampling, let us consider a network with a homogeneous degree distribution, for 
instance an Erdos-Renyi random graph, and for each node i assign a variable Xi, taken from a probabil- 
ity distribution p(x). Such a hidden variable is also known in literature as "fitness" |63II13(JI I4UII117| . 
Now we prune the graph leaving the edge with a probability q(xi, Xj). In a biological framework, 
Xi is for instance the free energy gain of a protein from being solved during the experiment. The 
interaction takes place only if the free energy loss x c in breaking a bond is compensated by the gain 
Xi + Xj of staying together in the solution. On the other hand, it is reasonable to assume an exponen- 
tial distribution for the free energy, i.e. p(x) ex exp(— x). The average degree k(x) of a sampled node 
as a function of its free energy is readily computed as 



where pN comes from the approximation that all nodes have about pN neighbors, and q(x,x') = 
6{x + x 1 — x c ). The integral in Eg. 13.21 yields k(x) oc Npexp(x), that inserted into the probability 
relation P{k)dk — p{x)dx, provides a power-law expression for the degree distribution of the sampled 
network, P(k) ~ k~ 2 . This striking result shows that, as a consequence of sampling biases, one may 
observe heterogeneous degree distributions even when the underlying network is homogeneous. In 
their work |205| . Petermann and De Los Rios show as well that even when the original network has 
power-law distributed connectivity, the exponent can be considerably underestimated. 

Sampling technological networks - As already stressed in this section, the case of tech- 
nological networks is completely different, since we do not have any idea of the topology of the real 
graph, but the sampling methods are based on very well-defined probing processes, that can be mod- 
eled using tree-like explorations. 

The first example of this class of networks is the World Wide Web, that is usually efficiently explored 
using so-called "crawling processes" |2()()j . The WWW, indeed, possesses the remarkable property that 
the links outgoing from a page are directly visible, thus we can apply snowball sampling methods, that 
are related to well-known processes like epidemic spreading and percolation (see Chapter 0}. A single 
node is firstly chosen together with its outcoming links and the nodes connected to them. Then, new 
nodes connected to those picked in the last step are selected. The process continues recursively until 
the desired number of nodes are gathered. The main limitation of using this method on the Web is 
the huge size of the network itself, that makes difficult to reach all remote regions. In general, in each 
layer only a fraction of the nodes is sampled and this may introduce some inaccuracies 236 . 

Snowball- like samplings do not work on the Internet, since routing protocols redirect probes along 
preferential (shortest) paths, thus preventing exploration algorithms from getting a complete knowl- 
edge of nodes neighborhood. The common sampling strategy consists in acquiring local views of the 
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Figure 3.2: Illustration of the snowball sampling technique. The original graph (left) is sampled using 
a snowball algorithm starting from a root vertex (black node on the right part of the figure). The 
first layer is composed of a fraction its neighboring nodes (dark grey nodes), that are used as starting 
nodes to explore the second layer (bright grey nodes), etc. 



network from several vantage points, merging these views in order to get a presumably accurate global 
map. Local views are obtained by evaluating a certain number of paths to different destinations by 
using specific tools such as traceroute-like commands or by the analysis of BGP tables. At first 
approximation these processes amount to the collection of shortest paths from a source node to a set of 
target nodes, obtaining a partial spanning tree of the network. The merging of several of these views 
provides the map of the Internet from which the statistical properties of the network are evaluated. 
According to this description, discovering the Internet topology is more than a simple network sam- 
pling problem, it consists in a real dynamical exploration process. 

The first contribution to the problem of sampling biases in the Internet was given by Lakhina et 
al. |163j . who showed that traceroute-like explorations can seriously affect the estimation of degree 
distributions. In particular, when the number of sources and destinations is small, one can observe 
power-law like distributions even in the sampling of Erdos-Renyi random graphs, whose original de- 
gree distribution is poissonian. Since the first data showing heavy-tailed distributions for the Internet 
topology have been collected gathering traceroute paths from a very limited number of sources and 
destinations |199j , they concluded that the Internet maps could be wrong, or at least the node degree 
distribution is not a sufficiently robust metric to characterize Internet's topology. Nevertheless, the 
same exploration performed on a network with a power-law degree distribution behaves very differ- 
ently, since a handful number of sources are sufficient to yield a sampled graph with degree distribution 
that looks very similar to the original one. The authors of Ref. |139| have moreover monitored nu- 
merically the observed degree of sampled nodes as a function of their real degree, finding that it is 
systematically underestimated. 

The analytical foundation to the numerical work of Lakhina et al. |lf>3| was provided in Refs. |78l IT] . 
in which the authors modeled traceroute explorations as single-source, all-destinations, shortest- 
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path trees. Using breadth-first search spanning trees, they rigorously proved that, for an Erdos-Renyi 
random graph with average degree (k), the connectivity distribution of the obtained spanning tree 
displays a power-law behavior A: -1 , with an exponential cut-off setting in at a characteristic degree 
k c ~ (k). We give here a non-rigorous derivation of this result using differential equations. 
A typical breadth-first search algorithm is the following. There are three types of nodes: explored, 
untouched, and pending. All edges are labeled invisible. The final observed network will be composed 
only of explored vertices and visible edges. The process starts with the root vertex labeled pending 
into a queue, all the others are untouched. Vertices are chosen from the queue in the first-in-first-out 
order, thus at the beginning the root is popped out. All the untouched neighbors of the vertex chosen 
from the queue (i.e. explored) are appended to the queue as pending vertices. The edges going from 
the explored vertex to these appended neighbors are made visible. Let now be u(t) and s(t) the den- 
sities of untouched and pending vertices respectively, the process can be described by the following 
system of differential equations |7B] . 

^ = -<*>«(*) ^ = <*>«(t)-l. (3.3) 

Using the initial conditions u(0) = 1 and s(0) = 0, we get a solution of the form u(t) = e — C 5 )' and 
s(t) = 1 — t — e - ^'. If a node is chosen at time t, its observed degree is the number of previously 
untouched neighbors plus one given by the edge we used to reach it. Since a node can be discovered 
at any time from t = to t = to (the smallest root of s(t) = 0), we get the degree distribution by 
means of the following temporal average, i.e. 

p (t+ i)~ir e ^)iM! a , (3.4) 

to Jo k - 

where we used the fact that the real distribution is poissonian and the untouched nodes at time t are 
homogeneously sampled from it with density u(t). After computing the integral and making some 
approximations it is easy to show that the observed degree distribution P(k) oc 1/k up to a degree 
k ~ (k) where a cut-off sets in. 

From such analysis one could conclude that the observation of heavy-tails in Internet's degree distri- 
bution is a fake effect due to the use of tree-like explorations; nonetheless, this result is strictly correct 
only in the case of single-source probing, as will be clearer in the following. 

All these results stress on the relevance of determining up to which extent the topological properties 
observed in sampled graphs are representative of that of the real networks. We have tried to answer 
this issue in the case of traceroute-like explorations using methods of statistical physics. The main 
results of this study, that led to the publications in Refs. |87l I8fil I88| . are illustrated in the next 
sections. 
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targets 

Figure 3.3: Illustration of the traceroute-like procedure. Shortest paths between the set of sources 
and the set of destination targets are discovered (red full lines) while other edges are not found (dashed 
black lines). Note that not all shortest paths are found since the "Unique Shortest Path" procedure 
is used. 

3.2 Statistical physics approach to traceroute explorations 

This section gives a formal statistical description of traceroute-like processes in terms of a simple 
model that provides a qualitative and quantitative understanding of the properties observed in real 
experiments. 

3.2.1 The model 

In a typical exploration, a set of active sources deployed in the network sends traceroute probes 
to a set of destination nodes. Each probe collects information on all the nodes and edges traversed 
along the path connecting the source to the destination [HI] . By merging the information collected on 
each path it is then possible to reconstruct a partial map of the network fFig. I3.3|l . More precisely, 
the set of edges and nodes discovered by each probe depend on the "path selection criterion" (p.s.c) 
used to choose the path between a pair of nodes. In the real Internet, many factors, including 
commercial agreement, traffic congestion and administrative routing policies, contribute to determine 
the actual path, causing it to differ even considerably from the shortest path. Despite these local, often 
unpredictable path distortions or inflations, a reasonable first approximation of the route traversed 
by traceroute-like probes is the shortest path between the two nodes. This assumption, however, is 
not sufficient for a proper definition of a traceroute model in that equivalent shortest paths between 
two nodes may exist. In the presence of a degeneracy of shortest paths we must therefore specify the 
path selection criterion by providing a resolution algorithm for the selection of shortest paths. 
For the sake of simplicity we can define three selection mechanisms among equivalent paths that may 
account for some of the features encountered in the Internet discovery: 

• Unique Shortest Path (USP) probe. In this case the shortest path route selected between a node 
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i and the destination target T is always the same independently of the source S (the path being 
initially chosen at random among all the equivalent ones). 

• Random Shortest Path (RSP) probe. The shortest path between any source-destination pair is 
chosen randomly among the set of equivalent shortest paths. This might mimic different peering 
agreements that make independent the paths among couples of nodes. 

• All Shortest Paths (ASP) probe. The selection criterion discovers all the equivalent shortest 
paths between source-destination pairs. This might happen in the case of probing repeated 
in time (long time exploration), so that back-up paths and equivalent paths are discovered in 
different runs. 

We will generically call .M-path the path found using one of these "metrics" or path selection criteria. 
Actual traceroute probes contain a mixture of the three mechanisms defined above, even if many 
effective heuristic strategies are commonly applied to improve the reliability and the performances of 
the sampling. An example of such heuristic tricks is the interface resolution algorithm called iffinder, 
proposed by Broido and Claffy |57j . In fact, a router can have more than one interface with the 
external world, thus different paths passing through different interfaces might erroneously consider 
two interfaces as two independent routers. Algorithms such as iffinder allow to avoid these type of 
errors. 

As remarked by Guillaume and Latapy |139j . the different path selection criteria may have influence 
on the general picture emerging from the theoretical model, but the USP procedure clearly represents 
the worst scenario among the three different methods, yielding the minimum number of discoveries. 
For this reason, we will focus only on the USP data. The interest of this analysis resides properly in 
the choice of working in the most pessimistic case, being aware that path inflations should actually 
provide a more pervasive sampling of the real network. 

Formally, the traceroute model is the following. Let G = (V, £) be a sparse undirected graph 
with vertices (nodes) V = {1,2,- • • ,N} and edges (links) £. Then let us define the sets of vertices 
S = %2, ■ ■ ■ , «at s } and T = {ji, j2, • • • , Jn t } specifying the random placement of Ns sources and 
Nt destination targets. For each ensemble of source-target pairs f2 = {S,T}, we compute with our 
p.s.c. the paths connecting each source-target pair. The sampled graph Q = (V*,£*) is defined as the 
set of vertices V* (with A^* = |V*|) and edges £* (E* = \£*\) induced by considering the union of all 
the .M-paths connecting the source-target pairs. The sampled graph should thus be analogous to the 
maps obtained from real traceroute sampling of the Internet. 

In the next section, we provide a mean-field analysis of the discovery process as function of the density 
Pt = Nt/N and ps = N$/N of targets and sources. In general, traceroute-driven studies run from 
a relatively small number of sources to a much larger set of destinations. For this reason, in many 
cases it is appropriate to work with the density of targets pr while still considering Ns instead of the 
corresponding density. This combination of the parameters allows us to compare mapping processes 
on networks with different sizes. An appropriate quantity representing the level of sampling of the 
networks is the probing effort e — N ?^ T ) that measures the density of probes imposed to the system. 
In real situations it represents the density of traceroute probes in the network and therefore a 
measure of the load provided to the network by the measuring infrastructure. 

3.2.2 Mean-field analysis 

Here, we provide a statistical estimate for the probability of edge and node detection as a function 
of Ns, Nt and the topology of the underlying graph. The method is based on a simple mean- field 
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statistical analysis of the simulated traceroute mapping. 
For each set il — {S, T} we define the quantities 



Nt 



^ I 1 if vertex i is a target; _ _ 

^— J l,H I otherwise, 



t=i 



. i 1 if vertex i is a source; , . 

x Oi,i s = < n _ . (3.6) 



■^-f ' s I otherwise, 

s=l ^ 

where dij is the Kronecker symbol. These quantities tell us if any given node i belongs to the set of 
sources or targets, and obey the sum rules J2i Ylt=i <*hh — and J2i Yl s =i = Ns- Analogously, 
we define the quantity af'" 1 ^ that takes the value 1 if the edge belongs to the selected path 

between nodes I and to, and otherwise. 

For a given set of sources and targets f2, the indicator function that a given edge will be discovered 
and belongs to the sampled graph is simply 7T.y = 1 if the edge belongs to at least one of the 
.M-paths connecting the source-target pairs, and otherwise. We can obtain an exact expression for 
iTij by noting that 1 — ir^j is 1 if and only if (i,j) does not belong to any of the paths between sources 
and targets, i.e. if and only if cr^'j — for all (l,m) 6 17. This leads to 

N T \ 

t-i I 

For a given set fl = {S,T}, this function is simply mj — 1 if the edge belongs to at least one 
of the Al-paths connecting the source-target pairs, and otherwise. Since we are looking at a purely 
statistical level, in order to get more useful expressions, we perform the average over all possible 
realizations of the set il = {S,T}. By definition we have that 

/ n s 

p T and (V"^,i s ) = As, (3.8) 






where (■••)si identifies the average over all possible deployment of sources and targets f2. These 
equalities simply state that each node i has, on average, a probability to be a source or a target 
that is proportional to their respective densities. In real processes there are correlations among the 
paths, due to the relative position of sources and targets, thus in general the average is a complicate 
quantity, that we cannot easily compute analytically. A first approximation can be obtained making 
an uncorrelation assumption that yields an explicit expression for the discovery probability. The 
assumption consists in computing the discovery probability neglecting the correlations among different 
paths originated by the position of sources and targets. While this assumption does not provide an 
exact treatment for the problem it generally conveys a qualitative understanding of the statistical 
properties of the system. In this approximation, the average discovery probability of an edge is 



/ N s N T \ ' 

l^m \ s=l t=l / , 



-1-11(1-^(4^), (3.9) 

where in the last term we take advantage of neglecting correlations by replacing the average of the 
product of variables with the product of the averages and using Eq. 13.81 This expression simply states 
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that each possible source-target pair weights in the average with the product of the probability that 
the end nodes are a source and a target. The discovery probability is thus obtained by considering the 
edge in an average effective media (mean-field) of sources and targets homogeneously distributed in the 
network. The realization average of {^f^^j is very simple in the uncorrelated picture, depending 

only of the kind of the probing model. In the case of the ASP probing, (crfj 71 ^ is just 1 if 
belongs to one of the shortest paths between I and m, and otherwise. In the case of the USP 
and the RSP, on the contrary, only one path among all the equivalent ones is selected. If we denote 
by <j( ! > m ) the number of shortest paths between vertices I and m, and by xf'™^ the number of these 
paths passing through the edge the probability that the traceroute model chooses a path going 

through the edge between / and m is (^i'/™^ — xfj 71 ^ /a^ l ' m \ 

The standard situation we consider is the one in which prPs <C 1 and since (^fj^J < lj we have 

n o - ptps (*s m) > j - n ^ (*s m) ) n ) > ( 3 - iq ) 

that inserted in Eq. 13.91 yields 

fas) si - 1 II ( cxp (-PTPs ( CT 5 m) ) )) = 1 ~ CX P (-PTPsbij) , (3.11) 

where btj — J2i^ m {^f'j"^^ • ^ n ^ nc casc °f * nc USP and RSP probing, the quantity by is by 

definition the edge betweenness centrality Y^M m x ^i'P /o - ^'" 1 ' |123l For the ASP probing, it is a 
closely related quantity. Indeed, if the shortest path is used as the metric defining the optimal path 
between pairs of vertices, the betweenness gives a measure of the amount of all-to-all traffic that goes 
through an edge or a vertex. Wc also recall, that the betweeness can be considered as a non-local 
measure of the centrality of an edge or vertex in the graph (see Section 12. 2. 4J) . 

Since the edge betweenness assumes values between 2 and N(N — 1), the discovery probability of 
an edge will therefore depend strongly on its betweenness. For instance, for edges with minimum 
betweenness bij = 2, we have (tIi.^q — %PtPs, that recovers the probability that the two end vertices 
of the edge are chosen as source and target. This implies that if the densities of sources and targets 
are small but finite in the limit of very large N, all the edges of the underlying graph have a finite 
probability to be discovered. On the other hand, the discovery probability approaches one for edges 
with high betweenness, thus predicting a fair sampling of the network. 

In most of the current realistic samplings, the situation is different. While it is reasonable to consider 
Pt a small but finite value, the number of sources is not extensive (Ns ~ C(l)) & n d their density 
tends to zero as N^ 1 . In this case it is more convenient to express the edge discovery probability as 

( n hj)n — 1 - CX P ( _e ^) i ( 3 - 12 ) 

where e = prNs is the density of probes imposed to the system and the rescaled betweenness b\j = 
N~ 1 bij is now limited in the interval [2A^ _1 , TV — 1]. In the limit of large networks (N — > oo), it is clear 
that edges with low betweenness have (^i,j) a ~ C(iV _1 ), for any finite value of e. This readily implies 
that in real situations the discovery process is generally not complete, a large part of low betweenness 
edges not being discovered, and that the network sampling is made progressively more accurate by 
increasing the density of probes e. 

A similar analysis can be performed for the discovery probability 7r,; of a vertex i. For each source- 



3.2. STATISTICAL PHYSICS APPROACH TO TRACEROUTE EXPLORATIONS 



45 



target set fl we have that 




N T \ / N s N T 



-E** II i-E^Ew"' • ( 3 - 13 ) 

t=i / ;^m^i V s=l t=i / 

where cr^' — 1 if the vertex i belongs to the .M-path between nodes I and m, and otherwise. Note 
that it has been considered that a vertex belonging to the set of sources and targets is discovered with 
probability one. The second term on the right hand side therefore expresses the fact that the vertex 
i does not belong to the set of sources and targets and it is not discovered by any .M-path between 
source-target pairs. By using the same mean-field approximation as previously, the average vertex 
discovery probability reads as 



Ti) a *l-(l- P S-PT) II \}-PTPs{v\ ^J- (3-14) 



As for the case of the edge discovery probability, the average considers all possible source-target pairs 
weighted with probability prPs- In the ASP model, the average (af'" 1 ^ is 1 if i belongs to o 

of the shortest paths between I and m, and otherwise. For the USP and RSP models, ( a^' m ' 



_ x 0-> m ) i a {i,m) w j iere xf is the number of shortest paths between I and m going through i. If 
PtPs ^ lj by using the same approximations used for Eo. 13. Ill we obtain 



l »/n 



1 - (1 - PS ~ pr) exp (-prpsh) , (3.15) 



where b % = £ ¥m/i (o"f' m) ) ■ For the USP and RSP cases, b t = J2i^ m ^i x i ' m) /<J (Urn) is the vertex 
betweenness centrality, that is limited in the interval [0, N(N — 1)] \1'2'M 1531 113()| . The betweenness 
value bi = holds for the leaves of the graph, i.e. vertices with a single edge, for which we recover 
(7Tj) n ~ ps + Pt- Indeed, this kind of vertices are dangling ends, that can be discovered only if they 
are either sources or targets. 

As discussed before, the most usual setup corresponds to a density ps ~ 0(N~ 1 ) and in the large N 
limit we can conveniently write 



'*/n 



1 - (1 - p T ) exp (-eft*) , (3.16) 



where we have neglected terms of order 0(N^ 1 ) and the rescaled betweenness bi = N _1 bi is now 
defined in the interval [0, N — 1]. This expression points out that the probability of vertex discovery 
is favored by the deployment of a finite density of targets that defines its lower bound. 
We can also provide a simple approximation for the effective average degree (k*) n of the node i 
discovered by our sampling process. Each edge departing from the vertex will contribute proportionally 
to its discovery probability, yielding 

(**>n = E f 1 - cx p (- eS «)) - e E^- ( 3 - 17 ) 



J 3 



The final expression is obtained for edges with eb\j <C 1. Since the sum over all neighbors of the edge 
betweenness is simply related to the vertex betweenness as J2j hj — 2(bi + N — I), where the factor 
2 considers that each vertex path traverses two edges and the term N — 1 accounts for all the edge 
paths for which the vertex is an endpoint, this finally yields 

2e + 2<4 (3.18) 
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10 4 


E 


10 5 


5.10 5 


22000 


55000 


(k) 


20 


100 


4.4 


11 


k c 


40 


140 


3500 


2000 



Table 3.1: Main characteristics of the graphs used in the numerical exploration. 



The present analysis shows that the measured quantities and statistical properties of the sampled graph 
strongly depend on the parameters of the experimental setup and the topology of the underlying graph. 
The latter dependence is exploited by the key role played by edge and vertex betweenness in the 
expressions characterizing the graph discovery. The betweenness is a nonlocal topological quantity 
whose properties change considerably depending on the kind of graph considered. This allows an 
intuitive understanding of the fact that graphs with diverse topological properties deliver different 
answer to sampling experiments. 



3.2.3 Numerical simulations on computer generated networks 

The previous theoretical results have provided some interesting insights on the topological properties 
that are responsible of the efficiency and the accuracy of the sampling. In this section, we present the 
results of extensive numerical simulations in which the sampling algorithm has been reproduced on 
computer generated graphs with different topological properties. In particular, we consider the two 
separated classes of homogeneous and heterogeneous networks. We use degree dependent quantities to 
monitor the efficiency of the sampling process as a function of the probing effort. The results are then 
exploited to understand the properties of the degree distributions of the sampled networks. 
Our data report the various measures for three different graphs: the Erdos-Renyi (ER) random graph 
as representative of the homogeneous class, and two heterogeneous random graphs, one with power- 
law distribution of the form P(k) ~ /c~ 7 (random scale free - RSF), and the other with Weibull 
distribution (WEI) P(k) = (a/c)(k/c) a ^ 1 exp(— (k/c) a ). Both forms have been in fact proposed as 
representing the topological properties of the Internet [S7j. In both cases, we have generated random 
graphs using the comfiguration model (see Section f2.4.2|l . The parameter choice is a = 0.25 and 
c = 0.6 for the Weibull distribution, and 7 = 2.3 for the RSF case. Two different average degree 
values (k) = 20, 100 have been used for the ER model. In all cases networks are of N = 10 4 nodes. 
The main properties of the various graphs are summarized in Table 13.11 



Efficiency in the numerical sampling of graphs - The first case we consider is that of 
homogeneous graphs (ER model). As shown in Ref. vertex and edge betweenness are homogeneous 
quantities and their distributions are peaked around their average values (b) and (b e ), respectively, 
spanning only a small range of variation. These typical values can be inserted into Eas. 13.121 and l3~THl 
to estimate the order of magnitude of probes that allows a fair sampling of the graph. Both (iTij)n and 
(iTi/a tend to 1 if e ^ max (fe) 1 , (b e ) 1 . In this limit all edges and vertices will have probability to 
be discovered very close to one. At lower values of e, obtained by varying px and Ng, the underlying 
graph is only partially discovered. Fig. 13.41 shows the behavior of the fraction N^/Nk of discovered 
vertices of degree k, where Nk is the total number of vertices of degree k in the underlying graph, 
and the fraction of discovered edges (k*) n jk in vertices of degree k. N^/Nk naturally increases with 
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Figure 3.4: Frequency N£/Nk of detecting a vertex of degree k (left) and proportion of discovered 
edges (k*} n /k (right) as a function of the degree in the RSF, WEI, and ER graph models. The 
exploration setup considers N$ — 5 and increasing probing level e obtained by progressively higher 
density of targets pr- The axis of ordinates is in log scale. 



the density of targets and sources, and it is slightly increasing with fc. The latter behavior can be 
easily understood by noticing that vertices with larger degree have on average a larger betweenness. 
On the other hand, the range of variation of k in homogeneous graphs is very narrow and only a large 
level of probing may guarantee large discovery probabilities. Similarly, the behavior of the effective 
discovered degree can be understood by looking at Eq. 13.181 Indeed the initial decrease of (k*) Q jk 
is finally compensated by the increase of (b)(k). The situation is different in graphs with heavy- 
tailed connectivity distributions (RSF and WEI models), for which the betweenness spans various 
orders of magnitude and the fraction of vertices with very high betweenness is not negligible. In 
such a situation, even in the case of small e, vertices whose betweenness is large enough (&,e ^> 1) 
have (ni)Q ~ 1. Therefore all vertices with degree k ^> e -1 // 3 will be detected with probability one. 
This is clearly visible in Fig. 13.41 where the discovery probability N^/Nk of vertices with degree k 
saturates to one for large degree values. Consistently, the degree value at which the curve saturates 
decreases with increasing e. A similar effect occurs in the measurements concerning (k*) n jk. After 
an initial decay (Fig. 13. 4J) the effective discovered degree increases with the degree of the vertices. 
This qualitative feature is captured by Ea. 13.181 that gives (k*) n /k ~ efc - x (l + (b)(k)). At large k 
the term k~ 1 {b){k) ~ Ar takes over and the effective discovered degree approaches the real degree 
k. Moreover, the broader the distribution of betweenness or connectivity, the better the sampling 
obtained. 

Degree distributions - We now get a clearer picture of the relation between the exploration 
process and the underlying graph, and we can tackle the important issue of determining the origin of 
sampling biases in the observed degree distributions. Fig. 13. 51 shows the cumulative degree distribution 
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Figure 3.5: Cumulative degree distribution of the sampled ER graph for USP probes. Figures 
(A) and (B) correspond to (k) = 20, and (C) and (D) to (k) = 100. Figures (A) and (C) show 
sampled distributions obtained with N$ — 2 and varying density target pt- In the insets we report 
the peculiar case N$ = 1 that provides an apparent power-law behavior with exponent —1 at all 
values of px, with a cut-off depending on (fc). The insets are in lin-log scale to show the logarithmic 
behavior of the corresponding cumulative distribution. Figures (B) and (D) correspond to px = 0.1 
and varying number of sources N§. The solid lines are the degree distributions of the underlying 
graph. For (k) = 100, the sampled cumulative distributions display plateaus corresponding to peaks 
in the degree distributions, induced by the sampling process. 

P c (k) = J2k'>k P(k') °f the- sampled graph defined by the ER model for increasing density of targets 
and sources. Sampled distributions look only approximately like the genuine distribution; however, 
for Ns > 2 they are far from true heavy-tail distributions at any appreciable level of probing. Indeed, 
the distribution runs generally over a small range of degrees, with a cut-off that sets in at the average 
degree (k) of the underlying graph. In order to stretch the distribution range, homogeneous graphs 
with very large average degree (k) must be considered, that emerges also from the rigorous proof in 
Ref. pQ provided for single-source explorations. 

However, other distinctive spurious effects appear in this case. In particular, since the best sampling 
occurs around the high degree values, the distributions develop peaks appearing as plateaus in the 
cumulative distribution (see Fig l3.5|) . The inset of Fig l3. 51 displays the single-source case, in which we 
recover the apparent scale-free behavior with slope — 1 . It is worth noting that the experimental setup 
with a single source corresponds to a highly asymmetric probing process, in which the mean-field 
approach, and consequently our theoretical predictions, are not valid. 

The present analysis shows that in order to obtain a sampled graph with apparent scale-free behavior 
on a degree range varying over n orders of magnitude we would need the very peculiar sampling of a 
homogeneous underlying graph with an average degree (k) ~ 10™, that is a rather unrealistic situation 
in the Internet and many other information systems, where the observed cut-off sets in at least at 
k ~ O(10 2 ) (i.e. n > 2). Indeed, it would mean that on average Internet's autonomous systems should 
have at least O(10 2 ) connections, that is a completely unrealistic huge number. On the contrary, in 
the case of RSP and ASP model, we observe that the obtained distributions are closer to the real one, 
almost independently of the probing effort. On graphs with heavy-tailed distributions (see Fie. 13. (ill , 
we observe completely different results, due to the fact that the distribution tail is fairly reproduced 
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Figure 3.6: Cumulative degree distributions of the sampled RSF and WEI graphs for USP probes. 
The top figures show sampled distributions obtained with N$ — 2 and varying density target px- The 
figures on the bottom correspond to px = 0.1 and varying number of sources N$. The solid lines are 
the degree distributions of the underlying graph. 



even at rather small values of e. Despite both underlying graphs (WEI and RSF) have a small average 
degree, the degree distribution spans more than two orders of magnitude, and the whole range is 
sufficiently well sampled by the exploration process. 

Some distortions occur for low and average degree nodes, that are under-sampled. This undersampling 
can either yield an apparent change in the exponent of the degree distribution (as also noticed by 
Petterman and De Los Rios in Rcf. 205 for single source experiments), or, if N$ is small, yield a 
power-law like distribution for an underlying Weibull distribution. As shown in Fig. 13.61 a small 
increase in the number of sources allows to discriminate between both forms even at small px- 
The disparity in the quality of the results for homogeneous and heterogenous networks is due to 
the different discovery efficiency reached by the process. In heterogeneous graphs, vertices with 
high degree are efficiently sampled with an effective measured degree that is rather close to the 
real one. This means that the degree distribution tail is fairly well sampled while deviations should 
be expected at lower degree values. In conclusion, graphs with heavy-tailed degree distribution allow a 
better qualitative representation of their statistical features in sampling experiments. Indeed, the most 
important properties of these graphs are related to the heavy-tail part of the statistical distributions 
that are indeed well discriminated by the traceroute-like exploration. 



3.2.4 Accuracy of the mapping process 

Up to now, we have focused on the efficiency of the sampling process, indicating which is the density 
of probes we have to deploy throughout the network in order to get reasonably good maps of the 
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network. Another important aspect is related to the level of accuracy that the exploration is able 
to achieve in the description of the local topology. The most common biases affecting the mapping 
process concern 1) the miss of lateral connectivity, and 2) the multiple sampling of central nodes (and 
edges), which may affect the efficiency of the whole process. 

While the first problem might be solved by an optimization in the deployment of probes, actually 
relying on a criterion of decentralization of sources and targets, multiple sampling can be studied 
through some general concepts like redundancy and dissymmetry of the discovery process. A sampling 
is redundant when nodes (edges) are discovered many times during the traceroute; it is locally 
symmetric when the neighborhood of the nodes is equally sampled, i.e. there are no preferential paths 
by which a node is traversed. In the following, we give quantitative measures of the level of redundancy 
and dissymmetry of a traceroute mapping process, revealing their relation with the topology of the 
sampled graph. 

Redundancy - On the one side, the node discovery process requires a certain level of re- 
dundancy, since each new passage might, in principle, contribute to a more detailed exploration of 
the neighborhood. When, however, the discovery frequency is too large, it can seriously affect the 
efficiency of the whole process. Let us define the edge redundancy r e (i,j) of an edge (i,j) in a 
traceroute-sampling as the number of probes passing through the edge (i, j). Using the notations of 
Section IB. 2. 21 this quantity is written for a given set of probes and targets as 

re(i,3) = E X>*. E^4i ro) • (3-19) 

l^m \s=l 4=1 / 

Averaging over all possible realizations and assuming the uncorrelation hypothesis, we obtain 

(re(i,j)) u - E PtPs ( a i,'j m) ) u = PTPshj ■ (3.20) 

This result implies that the average redundancy of an edge is related to the density of sources and 
targets, but also to the edge betweenness. For example, an edge of minimum betweenness bij = 2 
can be discovered at most twice in the extreme limit of an all-to-all probing. On the contrary, a very 
central edge of betweenness bij close to the maximum N(N — 1), would be discovered approximately 
0{N) times by a traceroute-probing from a single source to all the possible destinations. 
Similarly, the redundancy r n (i) of a node i, intended as the number of times the probes cross the node 
i, can be obtained: 

N s N T 

rn(i) = E^ ,m) E^E<w* ■ ( 3 - 21 ) 

l^rn s=l t—1 

After separating the cases I = i and m = i in the sum, the averaging over the positions of sources and 
targets yields in the mean-field approximation: 

(r n (i))a = E PsPT(4 l ' m) )n + ^PsPtN ~ 2e + psprh . (3.22) 

In this case, a term related to the number of traceroute probes e appears, showing that unavoidably 
a part of the mapping effort goes to generate node redundancy. 1 In Fig. l3.7l we report the behavior of 
the average node redundancy as a function of the degree k for both homogeneous and heterogeneous 
graphs. For both models, the behaviors are in good agreement with the mean-field prediction, showing 

By simple manipulation of formulas I3.2UI and 13.221 an equivalent of the identity J^,- &f,j = 2(bi + N — 1) for 
redundancies is £V (r e (i, j)) n ~ 2 (r„(i)) n - 2e . 
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Figure 3.7: Average node redundancy as a function of the degree k for RSF (top) and ER (bottom) 
model (N = 10 4 ). For the ER model, two blocks of data are plotted, for (k) — 20 (left) and for 
(k) = 100 (right) The target density is fixed (px = 0.1), and N$ = 2 (circles), 10 (squares), 20 
(triangles). The dashed lines represent the analytical prediction 2e + pgpx{b)(k) in perfect agreement 
with the simulations. 



the tight relation between redundancy and betweenness centrality. 

In the case of heavy-tailed underlying networks, the node redundancy typically grows as a power-law 
of the degree, while the values for random graphs vary on a smaller scale. This behavior points out 
that the intrinsic hierarchical structure of scale-free networks plays a fundamental role even in the 
process of path routing, resulting in a huge number of probes iteratively passing through the same 
set of few hubs. On the other hand, for homogeneous graphs the total number of node visits is quite 
uniformly distributed on the whole range of connectivity, independently of the relative importance of 
the nodes. This is a further element in favor of the argument that homogeneous graphs with large 
mean connectivity are pretty badly sampled. Indeed, the local topology of well-connected vertices is 
analyzed with the same level of accuracy as for low degree nodes, yielding to generally dissatisfying 
results. 

Dissymmetry: Participation Ratio - The high rate of redundancy found in the numerical 
data does not necessarily imply that the local topology close to a node is well discovered: preferential 
paths could indeed carry most of the probing effort. Let us consider the relative number of occurrences 
of a given edge (i,j) during the traceroute, with respect to the total occurrence for the edges in the 
neighborhood of i. For each discovered node i, we can thus define a set of frequencies {/,• }jev(») f° r 
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Figure 3.8: Participation ratio as a function of discovered (fc*) connectivity for RSF (top) and ER 
(bottom) models [N — 10 4 ). The target density is fixed (pr = 0.1) and three value of N$ are 
presented: 2 (circles), 10 (squares), 20 (triangles). The dashed lines correspond to the 1/fc* bound. 

the edges of its neighborhood; in terms of redundancy, the edge frequency /• is defined by 

jf = r '(*'fl 0</f<lV(y)e£. (3.23) 

Neglecting the correlations, we can write an approximation for the average edge frequency as 

t(i)\ _ / r e {i,j) \ (re(i,j)) n 

E, 

/lO flrnh: J h: : 

(3.24) 



^ PsPrhj _ bjj 
~ 2 PsPT {bi + N - 1) ~ 2(6 S ; + JV - 1) ' 

where we have used the identity J^jbij — 2(6j + N — 1). The calculation reveals that, at a first 
approximation, the edge frequencies are topological quantities, independent of the probing effort, in 
agreement with the fact that frequencies are relative quantities. The dissymmetry of the discovery 
of the neighborhood of a node may be quantified through the participation ratio of these frequencies 

j'ev(i) 

If all the edge frequencies of i are of the same order ~ 1/fc* (only discovered links give a finite 
contribution), the participation ratio should decrease as 1/fc* with increasing discovered connectivity 
fc*. Hence, in the limit of an optimally symmetric sampling, it should yield a power law behavior 
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Y2 (k*) ~ k*~ . When only few links are preferred, for instance because more central in the shortest 
path routing, the sum is dominated by these terms, leading to a value closer to the upper bound 1. 
Numerical data for Y2 as a function of the discovered (fe*) connectivity for different probing efforts, 
are displayed in Fig. 13.81 For heterogeneous graphs, the values of I2 tend towards the curve k*^ 1 for 
increasing e. The average local topology of low degree nodes seems to be sampled more homogeneously 
than the larger degree nodes. On the contrary, in the homogeneous case (ER), the figures show a 
general high level of dissymmetry persistent at all degree values, only slightly dependent on the actual 
connectivity. 

Dissymmetry: Entropy Measure - The edge frequency is influenced by the presence of 
sources and targets, thus we introduce a more refined frequency, fj^ defined as the number of probes 
passing through the pair (fc, i) — [i, j) of edges centered on the node i, with respect to the total number 
of transits through any of the possible couples of edges in the neighborhood of i. This frequency does 
not take into account single edges, but the path traversing each vertex and the dissymmetry of the 
flow. 

A simple qualitative estimation for the average frequency is obtained using the usual first approxima- 
tion for the edge redundancy 2 



Even in this more complex situation, the approximated expression for the frequencies depends only on 
topological properties of the underlying graph (such as the betweenness centrality and the clustering 
coefficient Ci). 

By means of this frequency, we define an entropy measure providing supplementary evidence of 
the tight relation between local accuracy, homogeneous sampling and topological characterization 
of graphs. Indeed, a traceroute that discovers nodes crossing a larger variety of their links, and with 
different paths, is expected to be more accurate (and likely efficient) than the one always selecting the 
same path. 

In the same spirit of the Shannon entropy |41|. which is a good indicator of homogeneity, we define 
the local traceroute entropy of a node i by 



where logfc* is simply a normalization factor. As usual, we define H(k) as the entropy averaged over 
the nodes of degree k. The numerical data of H(k) for RSF and ER models and for different levels of 
probing are reported in Fig. 13.91 The values for ER are slightly increasing both for increasing degree k 
and number of sources N$, with no qualitative difference in the behavior at low or high degree regions. 
On the other hand, the case of heterogeneous networks agrees with the previous observations. The 
curve for H(k), indeed, shows a saturation phenomenon to values very close to the maximum 1 at 

2 In this case the redundancy appearing in the fraction defining the frequency fjf) is the contribution 

psPTY.i^m^i ( cT ki m) CT ij' m) ) n b y the ed 

ges-pair (k,i) — Considering separately the portion of shortest path 

from I to i through k and from i to m through j, we replace the sum of average products with the approximation 

(1 — Cj)6fcj / — ^ jrin) / • Up to a factor the last averaged term is the redundancy 2 „ of the edge 

Since the sum of edge-pairs redundancies over the neighborhood gives essentially the redundancy of the central 
node i, the final expression follows easily. 
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PSPTbkibjjjl - Cj) 1 
2(6, + N - 1) psprh 
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(3.26) 
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Figure 3.9: Entropy vs. k: a saturation effect is clear at medium-high degree nodes for scale free 
topologies (RSF), instead of a more regular increase for homogeneous graphs (ER). In the figure there 
are different curves for Ns = 2 (circles), 10 (squares), 20 (triangles) and pr = 0.1. 



large enough degree, indicating a very homogeneous sampling of these nodes. 

Summarizing, in the case of heterogeneous networks, the nodes with high degree and betweenness are 
in general redundantly sampled, but present a rather symmetrical discovery of their neighborhood. 
On the contrary, in homogeneous networks vertices suffer a less redundant sampling, showing a higher 
dissymmetry of the local exploration process. This result should be taken into account in deciding 
source-target deployment strategies, in order to minimize both dissymmetry and redundancy. 

3.2.5 Optimization 

In the previous sections we have provided a general qualitative understanding of the efficiency of 
traceroute-likc exploration and the induced biases on the statistical properties. The quantitative 
analysis of the sampling strategies, however, is a much harder task that calls for a detailed study of the 
discovered proportion of the underlying graph and the precise deployment of sources and targets. In 
this perspective, Guillaume and Latapy have shown in Ref. |139| that the fraction TV* /N and E* / E of 
vertices and edges discovered in the sampled graph depend on the probing effort. Unfortunately, the 
mean-field approximation breaks down when we aim at a quantitative representation of the results, 
since the neglected correlations are necessary for a precise estimate of the various quantities of interest. 
For this reason we performed an exhaustive set of numerical explorations aimed at a fine determination 
of the level of sampling achieved for different experimental setups. 

In Fig. ltj.lOl we report the proportion of discovered edges in the numerical exploration of homogeneous 
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Figure 3.10: Behavior of the fraction of discovered edges in explorations with increasing e. For each 
underlying graph studied we report two curves corresponding to larger e achieved by increasing the 
target density px at constant N$ = 5 (squares) or the number of sources Ns at constant pt = 0.1 
(circles). 



(ER model) and heterogeneous (RSF and WEI models) graphs for increasing level of probing effort e. 
The level of probing is increased either by raising the number of sources at fixed target density or by 
raising the target density at fixed number of sources. As expected, both strategies are progressively 
more efficient with increasing levels of probing. 

In heterogeneous graphs, it is also possible to see that when the number of sources is Ns ~ 0(1) 
the increase of the number of targets achieves better sampling than increasing the deployed sources. 
On the other hand, our model of shortest path exploration is symmetric if we exchange sources with 
targets; therefore in the numerical experiments we can easily verify that if the number of sources is 
very large and px ~ 0(1/N), then the increase of the number of sources achieves better sampling 
than increasing the deployed targets. 

This finding hints toward a behavior that is determined by the number of sources and targets, Ns 
and Nt (or equivalently of Ng and pr). This point is clearly illustrated in Fig. 13. Ill where we report 
the behavior of E* /E and N* /N at fixed e and varying Ns and pr- The curves exhibit a non- 
trivial behavior and since we work at fixed e = ptNs, any measured quantity can then be written as 
f(pT, e/ Pt) — 9h{pt)- It is worthy noting that the curves show a structure allowing for local minima 
and maxima in the discovered portion of the underlying graph. 

This feature is a consequence of the symmetry by the exchange of sources and targets of the model, 
i.e. an exploration with (Nt, Ns) = (Ni, N2) is equivalent to one with (Nt, Ns) = (N2, Ni). In other 
words, at fixed e = N1N2/N, a density of targets pt — Ni/N is equivalent to a density p' T = N2/N. 
Since N2 = Pt, we get that at constant e, experiments with pt and p' T = e/(Npx) are equivalent, 
obtaining by symmetry that any measured quantity obeys the equality 

9e(PT)=9e{j^). (3.27) 
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Figure 3.11: Behavior as a function of pr of the fraction of discovered edges and nodes in explorations 
with fixed e (here e = 2). Since e = ptN$, the increase of pr corresponds to a lowering of the number 
of sources iVg. 



This relation implies a symmetry point signaling the presence of a maximum or a minimum at px = 
e/ (Npx)- We therefore expect the occurrence of a symmetry in the graphs of Fig. 13. Ill at pr — \J e/N. 
Indeed, the symmetry point is clearly visible and in quantitative good agreement with the previous 
estimate in the case of heterogeneous graphs. For homogeneous topology the curves have a smooth 
behavior that makes difficult the clear identification of the symmetry point. Moreover, USP probes 
create a certain level of correlations in the exploration that tends to hide the complete symmetry of 
the curves. The previous results imply that at fixed levels of probing e different proportions of sources 
and targets may achieve different levels of sampling. This hints to the search for optimal strategies in 
the relative deployment of sources and targets. The picture, however, is more complicate if we look at 
other quantities in the sampled graph. In Fig. 13.121 we show the behavior at fixed e of the measured 
average degree (k) normalized by the actual average degree (fc) of the underlying graph as a function 
of pr- The plot shows also in this case a symmetric structure. By comparing Fig. 13.121 with Fig. 13. Ill 
we notice that the symmetry point is of a different nature for different quantities: the minimum in the 
fraction of discovered edges corresponds to the best estimate of the average degree. A similar result 
is obtained for the behavior of the ratio (c)*/ (c) between the clustering coefficient of the sampled and 
the underlying graph: as shown in Fig. 13.121 the best level of sampling is achieved at particular values 
of e and Ns that are conflicting with the best sampling of other quantities. 

The numerical data obtained with different source and target densities hint to a possible optimization 
of the sampling strategy. The optimal solution, however, appears as a trade-off strategy between the 
different levels of efficiency achieved in competing ranges of the experimental setup. In this respect, 
a detailed and quantitative investigation of the various quantities of interest in different experimental 
setups is needed to pinpoint the most efficient deployment of source-target pairs depending on the 
underlying graph topology. 
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Figure 3.12: Behavior as a function of pr of the fraction of the normalized average degree (k)* / (k) 
and of the fraction of the normalized average clustering coefficient (c)* /(c) for a fixed probing level e 
(here e = 2). 



3.2.6 Non-local measures under sampling: the case of k-core structures 

Up to now, all statistical quantities studied on sampled networks are related to local properties and lo- 
cal correlations; but real networks may present non-locally correlated structures, whose integrity under 
sampling is even more questionable. For this reason, we have investigated the effects of traceroute- 
like sampling on the k-core organization of networks. 

The fc-core analysis of a network is based on a non-trivial decomposition in subgraphs, that has 
recently attracted the interest of physicists working in this field for its relation with box counting 
methods in the study of self-similarity of natural systems [2251 12261 1132) . 

The fc-core of a network, defined in Section 12.2.51 is the maximal subset induced by all nodes having 
at least k neighbors in it, and the fc-shell is the set of nodes belonging to the fc-core but not to the 
(fc + l)-core. 

The fc-core decomposition of a network, going from fc = 1 (i.e. the whole network without isolated 
nodes) up to the maximum available value k max , provides a hierarchical structure in which most in- 
ternal fc-shells contain high degree nodes belonging to the very fundamental backbone of the network, 
whereas the external ones are formed by low-degree and more peripheral nodes. In Ref. 00113 (see 
also the web-site of the visualization tool Lanet-VI |lfi5| L we have used the fc-core decomposition as 
a tool for analyzing networks, discovering hierarchical structures, and studying the main statistical 
properties at the different scales (i.e. in the different fc-cores). 

Almost all complex networks (real and synthetic) seem to share a common "scale-invariance" prop- 
erty with respect to the fc-core organization: indeed, after a simple rescaling the curves corresponding 
to quantities like the degree distribution P(k), the average nearest neighbor degree k nn (k) and the 
clustering coefficient c(fc), computed in the different fc-cores show a nice data collapse. It is thus 
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interesting to test the behavior of the fc-core decomposition in sampled networks, in order to check 
the robustness of these properties with respect to possible sampling biases. 3 

Note that a single source traceroute-like probing yields essentially a tree, then the fc-core decompo- 
sition is by definition trivial (with maximum core k max = !)• Yet, a sampling cannot discover paths 
or edges that do not exist, so that the maximal shell index of a network, k max , is not increased by 
partial sampling (as the maximal degree observed) , and reversely, the actual k max is at least equal to 
the one found by a sampling of the true network. 

Internal cores are more connected and are traversed by a larger number of paths, therefore we expect 
that a path-based sampling should intuitively discover and sample better the central cores, introducing 
stronger biases in the structure of the peripheral shells. Moreover, the shell index of a node is directly 
related to its routing capacity, since two nodes belonging to the same shell of index k have exactly 
k "distinct" paths between them, where distinct means that no node and no edge are used more 
than once. The abundance of paths between nodes corresponds also to a higher level of structural 
and functional robustness of the system. Hence, nodes with high shell index are expected to perform 
better in routing processes. 

We have checked such ideas performing a traceroute-like probing of various networks, and comparing 
their fc-core decomposition before and after sampling. We have used N$ — 50 sources, and various 
values of probing efforts from e = 0.1toe = 5. 

Figure l5.13l reports the curves of the fc-shell size as a function of the index for various network models 
and various sampling efforts. The numerical measures have been performed on four types of networks: 
the ER and RSF models, and two network obtained using the generators BRITE |17fi) and INET 
dHI, that are based on optimization strategies and are commonly used by computer scientist in order 
to reproduce some specific features of the Internet, such as power-laws and hierarchy. Both yield 
broad degree distributions and general properties similar to RSF graphs. For ER networks, shells are 
almost uniformly populated and concentrated in a range of index values around (k). Such networks, 
whose fc-core structure is very different from that observed for AS maps (see Ref. [B]), show a rather 
peculiar behavior also after sampling. 

On the contrary, the power-law shapes obtained for RSF or BRITE networks, are comparable to 
the one observed in the AS maps and look very robust under sampling; even if the slope is affected. 
Indeed, shells of smaller indices are less well sampled. In particular, the size of the first shell is 
strongly decreased by the sampling procedure; in some cases in fact, the first shell is larger than the 
second in the original network, but becomes smaller in the sampled network. We note that in the 
available AS maps, the first shell is indeed typically smaller than the second, and that the true AS 
network thus very probably exhibits a much larger shell of index k = 1. This is consistent with the 
idea that the proportion of leaves is extremely underestimated in current Internet mapping data. A 
similar argument should hold for the value of the exponent in the power-law behavior of the shell size 
vs. its index (see Ref. [S]|5]). 

Figure 13.141 reports the behavior of other typical quantities of the network: the average degree of the 
nearest neighbors of a node of degree k, and the clustering coefficient of nodes of degree k. These 
properties show self-similar features in the fc-core decomposition that seem to be preserved by the sam- 
pling process. Although the precise form of the degree distribution of the whole network is slightly 
altered, the basic correlation properties are conserved in the sampling. 



3 We will refer to cores and degrees using the same index fc, the distinction between the two cases being evident from 
the different context. When the context is not clear, we will however specify the meaning of k (if it indicates the degree 
or the fc-core index). 
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Figure 3.13: Plot of the size of the fc-shells vs. k for various models, before and after traceroute- 
like sampling, with different probing efforts e. We used an Erdos-Renyi (ER) random graph with 
(A;) = 20, a random scale-free network (RSF) with exponent 7 = 2.3, and two networks obtained by 
the generators BRITE and 1NET, popularly used in the computer scientist community to reproduce 
some features of the Internet topology. All networks have size N = 10 5 , except for INET (N = 10 4 ). 
Note that the k index on the a;-axis indicates the fc-shell index, not the degree. 

While on a qualitative level, it seems possible to distinguish between networks with different topolog- 
ical structures, important quantitative biases appear, that are related with the arguments exposed in 
Section [3.2.31 about the efficiency of nodes discovery. For a network obtained with BRITE, Fig. 13.151 
displays the probability that the original shell index k of a vertex has changed in k' due to the sam- 
pling process. At low sampling effort, many vertices remain completely undiscovered, and in general 
shell index properties are strongly affected in a seemingly erratic way (see the plot for e = 0.1). For 
larger values of sampling effort a strong correlation appears between the two shell indices, even if a 
systematic downwards trend is observed. 

A naive explanation for such persistence of the fc-core structure under traceroute-like sampling 
is given looking at the process of path merging by which we get the maps. Let us consider three 
traceroute paths between three different pairs of vertices: if these paths meet two-by-two in some 
nodes, such three nodes result to be connected by a cycle, that is a 2-core. Increasing the number 
of paths, it is possible that strongly connected sets of nodes emerge (in the sense of a large number 
of existing paths between them). This picture is likely to be true in scale-free networks, in which 
the central nodes are redundantly sampled by traceroute (as seen in Section |3.2.4|) . Hence, the 
presence of a genuine fc-core structure in sampled graphs is due to the fact that nodes are sampled 
by means of paths. On the contrary, since node-picking algorithms do not guarantee that nodes are 
fairly connected, in many biological networks the k-core structure might be strongly biased by sampling 
methods. 

In summary, the results presented here indicate that the sampling biases do in fact affect only slightly 
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Figure 3.14: Nearest neighbors degree distribution (left) and clustering spectrum (right) of some k- 
cores, rescaled by the corresponding average values, for some network models after sampling through 
a traceroute-like process with Ns — 50 sources and target density Nt/N = 0.1. Here, the index k 
refers to the degree. 

the measure of the statistical properties of the fc-core organization in heterogeneous graphs, even at 
relatively low levels of sampling. This corroborates the idea that the k-core properties observed for 
the Internet are genuine. Quantitative analysis is more problematic, due to the incomplete sampling 
of the edges. In fact, the routing properties of a network are related with the multiplicity of paths 
between nodes, and thus with fc-core properties. Hence, we conclude that "measured" routing capacity 
of nodes, if limited to the analysis of Internet maps, are certainly rather underestimated compared to 
real performances. 
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Figure 3.15: The grayscale code gives the probability of a change in shell index due to the traceroute- 
like sampling, from a certain index before sampling (x axis) to another one after sampling (y axis). 
The line at y — represents the probability of vertices of shell index x to be absent from the sampled 
graph. The initial network is obtained by the BRITE generator. Here N$ — 50 sources and a fraction 
N T /N = 2.1CT 3 (left) and N T /N = 2.1CT 2 (right) of targets are used. 



3.3 Network Species Problem: a statistical method to correct 
biases 

An unexpected application of the traceroute model of Internet explorations is that of providing a 
theoretical framework in which it is possible to define and test statistical estimators for important 
unknown properties of the networks and study which is the best estimator for a given underlying 
topology. In fact, in Ref. |242| . we have shown that the inference from traceroute-like measurements 
of many of the most basic topological quantities, including network's size and degree characteristics, is 
a version of the so-called 'species problem' in statistics. This observation has important implications, 
since the species problem is known to be of a particularly challenging nature. 

A basic example of a traceroute-based species problem is the estimate of the number of nodes in 
a network fSection 13.3. ljl . Using statistical subsampling principles we have derived two estimators 
for this quantity fSection l3.3.2|) . the performances of which will be illustrated by means of numerical 
simulations on networks with various topological characteristics fSection 13. 3. 3|) . 

According to the results exposed in the previous section, one can conclude that at a qualitative level 
traceroute-like samplings are reliable. On the other hand, at a quantitative level real networks (e.g. 
the Internet at different levels) can considerably differ from sampled maps. The species approach 
seems to be valuable to estimate quantities, such as the size, the number of links, the average degree, 
and the precise analytic form of the heavy-tailed degree distribution, that cannot be estimated using 
previously exposed techniques. 



3.3.1 The Species Problem in Networks 

Let us call r\ = 77(G) a generic global quantity characterizing a graph G. In general, the real value 
of t] is not known, thus it is natural to wish to produce an estimate, say fj, based on the network 
sampling, i.e. on the traceroute-sampled graph G* . However, for quantities like the size N, the 
number of edges E, the average degree (k), the problem of their inference is closely related to the 
species problem in statistics. In general, the species problem refers to the situation in which, having 
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observed n members of a (finite or infinite) population, each of whom falls into one of C distinct 
classes (or 'species'), an estimate C of C is desired. This problem arises in numerous contexts, such 
as numismatics (e.g., how many of an ancient coin were minted |119| L linguistics (e.g., what was the 
size of an author's apparent vocabulary |174l and biology (e.g., how many species of animals 

inhabit a given region). 

The species problem has received a good deal of attention in statistics (see, for instance, Ref. [SU] 1 ). 
but it is, in general, a difficult problem, since we need to estimate the number of species not observed. 
In practice, the species expected to be missed are those that are present in relatively low proportions 
in the population, and there could be an arbitrarily large number of such species in arbitrarily low 
proportions. The methods proposed for its solution differ in the assumptions regarding the nature of 
the population, the type of sampling involved, and the statistical machinery used. We have shown 
that it can also be associated with the inference of graph characteristics ry(G) in traceroute-like 
samplings. 

For example, the problem of estimating the number of vertices and edges in a network G i.e., N and 
E may be mapped on the species problem by considering each separate vertex i (or edge e) as a 
'species' and declaring a 'member' of the species i (or e) to have been observed each time that i (or 
e) is encountered on one of the N$ x Nt traceroute paths. 

Again, the problem of inferring the degree ki of a vertex i from traceroute measurements can also be 
mapped on the species problem, by letting all edges incident to i constitute a species and declaring a 
member of that species to have been observed every time one of those edges is encountered. Because 
the values N, E, and {ki}i^y serve as basic components of many of the other standard quantities 
listed above, obtaining an accurate inference of the former could directly impact our ability to make 
accurate inferences on the latter. In the following, we will consider only the inference of the network 
size N, but we are now developing a similar formalism in order to extend our analysis to the number 
of edges E. 

3.3.2 Inferring iV: Estimators of Networks Size 

Before proceeding to the construction of estimators for N, it is useful to first better understand the 
relation between this quantity and known characteristics of the Internet topology. 
Since the main property affecting the exploration process is the betweenness centrality, one could 
argue that N should be estimated using quantities derived from the betweenness. In particular, the 
network's size is related to the betweenness centrality by the following simple expression |131j . 

J2bi=N(N-l)({l)-l) , (3.28) 

i 

in which (£) is the average distance between pairs of nodes. This may be rewritten in the form 

TV = 1 + , (3.29) 

where the expectation E[-] is with respect to the distribution of betweenness across nodes in the net- 
work. 

In general, the average shortest path length (£) can be estimated quite accurately, since traceroute 
probes lay on the shortest paths and the corresponding distribution is very peaked. Therefore, the 
problem of estimating N is essentially equivalent to that of estimating the average betweenness cen- 
trality. From the theoretical analysis in previous sections, we know that traceroute experiments 
give a good estimate of degree and betweenness distributions tails, thus we can assume that the form 
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P(b) b^ 13 for b >> 1 is sufficiently accurate, but low betweenness nodes are considerably undersam- 
pled, preventing us from having a correct quantitative knowledge of the full distribution. Hence, the 
undersampling of low-betweenness nodes does affect the average value of the betweenness. Addition- 
ally, even if we neglect this problem and divide the expectation in two contributions for low and high 
betweenness nodes (E[fe] = Ei[6] + E2[6]), in order to compute the average betweenness, we should 
perform an integral of the type 



that has to be handled carefully since the experimental values of (3 are very close to 2 |131l 131] . thus 
the integral diverges with the upper cut-off. 

These simple arguments give an idea of the difficulty of estimating N from traceroute measurements 
and suggest the futility of attempting a parametric approach with current measurement tools and 
information. There is still the alternative of a nonparametric approach, in which assumed parametric 
distributions are eschewed. We have proposed two estimators for N, using subsampling principles: one 
is based on the resampling of the network, the other is a refined estimator based on the "leave-one-out" 
principle [TT2l I33B| . 

Resampling Estimator - Let us call discovery ratio 8 = M[N*]/N the average fraction of 
nodes discovered. From the mean-field theory we have learned that the quantity varies smoothly as 
a function of the fraction pt = Nt/N of targets sampled, for a given number N$ of sources. We use 
this fact, paired with the assumption of a type of scaling relation on G, to construct an estimator for 



Specifically, let H be an arbitrary subgraph, of size N(H), of the network graph G. We will as- 
sume that, for roughly similar numbers Ns(H) and Ns(G) of sources used, the discovery ratios for 
traceroute sampling on H and G are such that 



(i) they vary as smooth functions 9(H\ pt{H)) and 0(G; pr(G)) of pr(H) and pr(G), respectively; 

(ii) if p T (H) = p T (G), then 8(H; p T {H)) = 9{G; p T {G)). 



In other words, we expect similar proportions of targets to yield similar proportions of discovered 
nodes. Our choice to use Ns{H) = Ng{G) stems from the fact that typical traceroute-driven 
studies run from a relatively small number of sources to a much larger set of destinations. Rewriting 
the expression 9(H; pt{H)) = 8(G; pr{G)) yields the equation 



Now if H is a known subgraph, N(H) is known as well, and the problem of inferring N = N(G) can 
be reduced to one of inferring p(G. H). A natural candidate for such a subgraph is the choice H = G* 
i.e., the graph produced by an initial traceroute sampling on G. In this case, N(H) = N*. 
It remains then to estimate p(G,G*), which must be defined conditional on G*. In that case, the 
expectation in the numerator is simply E[N*(G) \G*] = N*. To estimate the other expectation, 
E[N*(G*) | G*], we use a strategy based on the resampling of paths in G* . In particular, for a given 
sampling rate p^ = pt(G*), we sample N£ — p^N* targets on G* and create a resampled graph, say 
G** , from the corresponding traceroute paths. Let N** = N(G**). We do this some number of 
times, say B 7 forming the collection ATf*, . . . ,Ng*. Then, we estimate E[N*(G*)\G*} by the average 




(3.30) 



N. 



N(G) = N(H) p(G, H) , where p(G, H) 



E[N*(G)] 
E[N*(H)} 



(3.31) 



N* B * = (l/B)^2 r N**. 
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Plugging these quantities into the expression for N in Eq. 13.311 we get 

N* 



N RS = N* 



(3.32) 



as a resampling-based estimator for N. 

Note, however, that its derivation is based upon the premise that p T = pr, and pr is unknown (i.e., 
since N is unknown). This issue may be addressed by noting that the equations pr(H) = pt(G) 
and 0{H;p T {H)) = 6(G;p T (G)) together imply the equation N T (G)/N T (H) = E[N* (G)]/E[N* (H)}. 
With respect to the calculation of Nrs, this fact suggests the strategy of iteratively adjusting N T = 
N T (G*) until the relation N T /N T » N*/N^* holds (Robbins-Monroe algorithm [2"T3]V The value of 
Ng* for the appropriate N T is then substituted into Ea. 13.321 to produce Nrs- In practice, one may 
increase B as the algorithm approaches the condition N t /Nt ~ N** /N*. 

Leave- One- Out Estimator - Various other subsampling paradigms might be used to con- 
struct an estimator. A popular one is the 'leave-one-out' strategy, which amounts to subsampling G* 
with N T = A^t — 1. We apply such a principle to the problem of estimating N, in a way that does 
not require the subsampling assumptions in Eq. 13.311 

Recall that V* is the set of all vertices discovered by a traceroute study, including the Ns sources 
S = {si, . . . , sjv s } and the Nt targets T = {<i, . . . , t^ T }. Our approach will be to connect N to the 
frequency with which individual targets tj are included in traces from the sources in S to the other 
targets in T \ {tj}. Accordingly, let V*j be the set of vertices discovered on the path from source s, 
to target tj, inclusive of and tj. Then the set of vertices discovered as a result of targets other 
than a given tj can be represented as V*_,^ = Uj Uj/^y V*j,. Next define 5j = I |tj ^ to be 

the indicator of the event that target tj is not 'discovered' by traces to any other target. The total 
number of such targets is X = J^j $j ■ 

We derive a relation between X and N . Assuming a random sampling model for selection of source 
and target nodes from V, we have 



Pr ( Sj = 1 



V, 



(-j) 



N — N, 



(-3) 



N- N 



s 



N 7 



where N. 



C-i) 



V, 



(-3) 



Note that, by symmetry, the expectation E 



N* 



denote this quantity by E 



N, 



As a result, we may write 



E[X]=J2 
which may be rewritten as 



A^-E 


N* 


N T 


(n-e 




) 


N — Ns — Nt + 1 N - 


-N s -. 


N T + 1 



(3.33) 

is the same for all j: we 
(3.34) 



N 



N T E N^ 



{N s + N T - 1)E[X] 



N T - E[X] 

To obtain an estimator for N from this expression it is necessary to estimate 



A r , 



(-) 



(3.35) 
and E[X], for 



which it is natural to use the unbiased estimators NTs — (l/A r r)53 J - ^(-j) anc ^ ^ itself, measured 
during the traceroute study. However, while substitution of these quantities in the numerator of 
Ea. 13.351 is fine, substitution of X for E[X] in the denominator can be problematic in the event that 
X = Nt- Indeed, the estimator of A^ diverges when none of the targets tj are discovered by traces to 
other targets, that is possible if pt = Nt/N is small. A better strategy is to estimate the quantity 
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1/(Nt — X) directly. We assume that the overlap between different sampled set of vertices is very 
high. Using this condition, it is possible to derive an approximately unbiased estimator of 1 / (Nt — X) . 
(Note that empirical data on the Internet collected by the Skitter project at CAIDA [22] show that 
the discovery rate is rather uniform, validating our assumption.) The same overlapping argument 
implies that N?_^ ~ N* , for all j, which suggests replacement of N?s by N* . Putting together all 
these quantities in Eq. 13.351 and with a bit of algebra (see Rcf . 242 for a detailed derivation) we get 

N* — (Nq + Nt) 

N L1Q « (N S + N T ) + ; , T> , (3.36) 

1 — to* 

where w* = Xj (Nt + 1), X being the number of targets not discovered by traces to any other target. 
In other words, Nlxo can be seen as counting the N$ + Nt vertices in S U T separately, and then 
taking the remaining N* — (Ns+Nt) nodes that were 'discovered' by traces and adjusting that number 
upward by a factor of (1 — to*) . This form is in fact analogous to that of a classical method in the 
literature on species problems, due to Good |134| . in which the observed number of species is adjusted 
upwards by a similar factor that attempts to estimate the proportion of the overall population for 
which no members of species were observed. 



3.3.3 Numerical Results 

We have tested the performances of the estimators on three different types of networks, two computer- 
generated networks, with homogeneous (ER model) and heterogeneous (BA model) degree distribu- 
tion, and a network based on measurements of the real Internet (Mercator mapping project |135| ~1. 
For the synthetic networks, we have considered average degree 6, and sizes ranging from 10 3 to 10 6 
nodes. The Mercator network (N = 228263 nodes and E = 320149 edges) has been used to see if 
more realistic topologies give results in agreement with that from the models. 

We plot in Fig. 13.161 the ratio of the estimators to the true size, Nrs/N, and Nlio/N, together 
with N*/N, for the various investigated graphs, number of sources N$ — 1, 10, and 100, as a function 
of the target density pt- The improvement with respect to the "trivial" estimation by the size N* 
of the sampled graph is impressive, the optimal value being 1 for all these curves. Increasing either 
the number of sources ATg or density of targets pT yields better results, even for N* , but the esti- 
mators we have introduced converge much faster than N* towards values close to the true size N. 
In fact, a relatively small number of sources and targets is sufficient for these estimators to perform 
very accurately, in particular, in the case of the "leave-one-out" estimator. Note, however, that the 
"leave-one-out" estimator has a larger variability at small values of pt, while that of the resampling 
estimator is fairly constant throughout. This is because in calculating Nrs the uncertainty scales in 
the same way both in the sampling and in resampling, whereas for N^io the uncertainty scales with 
Pt that changes passing from sampling to resampling. 

In terms of topology, the estimation of N appears to be easiest for the ER model. Even N* is 
more accurate, i.e. the discovery rate is higher. The estimation for the Mercator graph seems to be 
the hardest, possibly because it is the graph with the largest fraction of low degree (and thus low 
betweenness) nodes among the three studied. It is worthy noting that the "leave-one-out" estimation 
seems to depend only on N* /N and Nt, thus being quite stable in the three graphs. 
Interestingly, the estimators perform better for larger sizes, as shown by Fig. l3.lfl in which we investi- 
gate, at fixed N$ and pr, the effect of the real size of the graph N. On the contrary, N* /N decreases. 
This is due to the fact that the sample graph G* gets bigger providing more and richer information, 
even if the discovery ratio does not grow. The odd nature of the results for the BA graph comes from 
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Figure 3.16: Comparison of the various estimators for the BA (top), ER (middle) and Mercator 
(bottom) networks. The curves show the ratios of the various estimators to the true network size, as 
a function of the target density pr- Full circles: Nlio/N ; Empty squares: Nrs/N; Stars: N*/N. 
The errorbars go from the 10% to the 90% percentiles. Left figures: N$ — 1 source; Middle: N$ = 10 
sources; Right: N$ = 100 sources 



the peak associated with the resampling estimator, see Fig. 13.161 At fixed number of targets however, 
Nrs/N and Nlio/N decrease as N increases (not shown, see Ref. |242| L 

The comparison of the two estimators show that the resampling estimator, although yielding a clear 
improvement with respect to N*, systematically performs less well. This is probably due to the fact 
that the basic hypothesis of scaling (Eq. I3.31J) is only approximately satisfied, while for Nlio the 
underlying hypotheses are well satisfied. Nonetheless, the resampling estimator needs less information 
on the graph than Nlio, since the "leave-one-out" estimator uses the knowledge that some targets 
are not discovered by the paths to other targets. 

Of course, in order to apply this inference model to the Internet one should take into account further 
issues, as the effects of non-random deployment of sources or the relation with more realistic models 
of traceroute exploration. However, the "leave-one-out" estimator should not suffer much in perfor- 
mance, since its derivation assumes only uniform random choice of targets, not that of sources and 
does not make any assumption on the routing strategy. 

Finally, we provide a phcnomcnological validation of our technique. The Internet size can also be 
estimated using other techniques, for instance with ping probes in order to test the response of some 
sufficient number of randomly chosen IP addresses (the total number of possible IP addresses is 2 32 ). 
Computing the fraction a of 'alive' addresses, we get the estimator N ping = 2 32 a for the size of the 
Internet. 

In order to check if the "leave-one-out" estimator gives results in agreement with the latter one, we 
performed about 3.7 • 10 6 ping's probing of the net, obtaining 61246 valid responses and, in parallel, 
a traceroute study from the same source towards the same number of unique IP addresses. The 
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Figure 3.17: Effect of the size N of the graph G for BA and ER graphs at constant number of sources 
and density of targets. The curves show the ratios of the various estimators to the true network size, 
as a function of the graph size N. Full circles: Nlio/N ; Empty squares: Nrs/N; Stars: N*/N. 
The errorbars go from the 10% to the 90% percentiles. N$ = 10. Left figures: px = 10~ 3 ; Middle: 
p T = 10~ 2 ; Right: p T = 10" 1 . 



estimated size was N pin g = 7, 06 • 10 7 using the ping estimator, and N ping = 7, 23 • 10 7 using the 
"leave-one-out" estimator, confirming a good agreement between the two. Note that the numbers are 
not intended to taken too seriously (we used only one source, toward a small number of targets), but 
it is important that they give consistent results. 
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3.4 Conclusions 

In contrast with usual physical disciplines, the science of complex networks is characterized by the 
absence of first principles from which a theoretical framework can be deduced; for this reason, a very 
actual and urgent problem is that of testing the reliability of phenomenological data, on which the 
whole theoretical analysis and modeling of networks is based. The first step in this direction is that of 
assessing the validity of the peculiar topological properties observed in so many real networks, from 
biological to social and technological ones. This means that we need an accurate description of the 
sampling processes by which data are collected. 

In this chapter, we have highlighted the main features of different mechanisms of sampling that are 
involved in the experimental analysis of networks. In some cases, such as for biological and social 
networks, sampling methods may suffer of strong biases that are often unpredictable, because of 
the incomplete knowledge of all (hidden) processes involved in the experiments. In the case of the 
Internet, instead, there exists a well-known and very useful practical method to extract information 
from the network, that consists in the dynamical exploration by means of traceroute-like probes. 
Unfortunately, also this method suffers of biases that could in principle cause a misrepresentation of 
the topological properties of the network, in particular of the degree distribution. 
By means of a mean-field statistical approach, we have provided an insight into the relation between 
sampling biases and topological properties of the network, showing that the efficiency and accuracy 
of the exploration process depend mainly on the betweenness centrality distribution of the underlying 
network. The other important parameters are the densities of sources and targets deployed in the 
system, or more simply, the probing effort, i.e. the number of probes used in the exploration. In 
particular, we have explained how the observed distortions of the degree distribution are actually 
possible only when the underlying network is homogeneous and the traceroute process is performed 
using very few sources. Moreover, in order to observe power-laws starting from homogeneous networks, 
the average degree of the original network must be of the same order as the power-law cut-off, a rather 
unlikely situation for any real technological network (in particular for the Internet at the AS level). 
Our results provide a strong evidence that in order to observe a power-law distribution, we need at 
the origin a broad distribution, but show as well that networks with other heterogeneous distributions, 
different from power-laws, can be seen as scale-free networks under sampling. With a low number of 
sources and targets, as that involved in outdated Internet mapping projects, we cannot completely 
exclude the possibility that the degree distribution of the real network has the shape of a Weibull 
distribution rather than of a stretched exponential. According to our analysis, in case of heterogeneous 
networks with a non power-law degree distribution, an increase of both source and target densities 
should make visible departures from it. 

In agreement with our conclusion on the necessity of increasing the number of sources used in the 
Internet mapping projects in order to get more accurate data, it has been recently developed a 
distributed mapping project called DIMES (Distributed Internet MEasurements & Simulations) |981 
E21, with the aim of studying the structure and topology of the Internet using a very large number 
of sources. The idea is that of involving in the measurements common Internet users, creating a 
volunteer community that sends traceroute-like probes throughout the Net. Preliminary results 
from the DIMES project confirm the power-law shape for the degree distribution. The fact that 
DIMES is observing power-laws seems to be the ultimate evidence of the genuine nature of scale-free 
topologies in the Internet. Nonetheless, it is still very likely that the real exponent is different from 
(i.e. larger than) the observed one. 

In conclusion, the main improvement upon this field due to the results obtained in this thesis are 
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summarized in the following points: 

• We have identified, at a mean-field level, the relations between observed properties of the sampled 
networks and the topological properties of the underlying network. 

• The origin of sampling biases in the Internet measurements has been uncovered. These prop- 
erties, together with a minimal set of external parameters (the density of sources and targets) 
govern the efficiency and accuracy of the sampling process. 

• Tuning the external parameters we have provided some results that should serve as a guidance 
for more realistic optimization strategies. 

• The effects of traceroute explorations on local and non-local structures have been analyzed. 

• The traceroute-like explorations can be translated in the framework of statistical species prob- 
lems, leading to the formulation of non-parametric statistical approaches for the inference of 
Internet's global properties. As an example of the validity of this method to correct the biases 
introduced by the sampling process, we have introduced unbiased estimators of the number of 
nodes in the network, that can be applied to the case of the Internet, the real size of which is 
still unknown. 
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Chapter 4 



Spreading and Vulnerability in 
Complex Networks 

4.1 Introduction 

A second topic that has been object of investigation in this thesis concerns the role played by some 
topological and dynamical properties in determining the functionality of weighted networks. In partic- 
ular, we focus on the analysis of vulnerability and spreading properties. The motivations for this type 
of analysis, coming from the observation of the functional properties of real infrastructure networks, 
are exposed in Section [4.1. II while in Section [4.1.21 we introduce the main idea of the chapter, con- 
sisting in the possibility of linking two apparently different subjects as vulnerability and spreading by 
means of percolation- like approaches. Then, in Section 14.21 we focus on the vulnerability of weighted 
networks, taking into account the case study of the airports network. In Section 14.31 we propose a 
theoretical framework for the study of spreading phenomena in weighted networks, using the methods 
of percolation theory. 

4.1.1 Motivations 

From a practical point of view, an accurate knowledge of structural and dynamical properties of net- 
works is of primary interest, particularly in the case of technological networks, that form the backbone 
of world-wide communication and transportation infrastructures. On the other hand, such networks 
are intrinsically weighted networks, so that neglecting weights could lead to wrong conclusions about 
these properties. As mentioned in Chapter ^ weights are usually related to dynamics: in some cases 
they give just a measure of the traffic on the edges, in other cases they pinpoint the different attitude 
of the edges in exchanging physical quantities (energy, information, goods, etc). 

From the point of view of a single dynamical process evolving on a network (e.g. a spreading pro- 
cess), the introduction of weights on the edges corresponds, using a pictorial representation, to add 
an energy-like variable to the system and look at it as it was in an energy landscape. In this scenario, 
some paths with larger weights (or smaller, it depends on weight's definition) will be dynamically more 
convenient than others, and thus they will be preferentially chosen in the spreading. In addition, as 
the trajectory of a particle in a potential is deformed when we change the shape of the potential, fluxes 
on weighted networks adjust themselves in order to follow variations of both topology and weights. 
In other words, the presence of weights on the edges tunes the functionality of the system towards an 
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optimal point. This picture suggests us the idea of technological networks as critical infrastructures, 
whose topological, dynamical and functional properties are intimately related. 

The motivations of our research in the field of weighted networks can be summarized in the following 
issues: 

• Which are the levels of structural and functional stability of such critical structures if we perturb 
it in different ways? 

• Which are the conditions for a macroscopic spreading? Which is the relation with the functional 
properties of the network? 

Both our queries are concerned with the prediction of extreme events such as that of a global collapse 
triggered by the spreading of a virus, a cascade failure or malicious attacks, and with the necessity 
to protect real networks, ensuring their functioning. Nevertheless, the two arguments are developed 
separately. 

We will first study the vulnerability properties of weighted graphs, showing that structural robustness 
does not coincide with functional robustness if traffic is taken into account. The analysis is carried out 
by means of a relevant case study, where practical implications are visible: the airports network. The 
weighted representation is, furthermore, general enough to include other contributions, not strictly 
due to the traffic. For instance, infrastructure networks are embedded in the real space, with euclidean 
distances between nodes, and longer connections have higher costs. It is thus possible to verify the 
role played by all these quantities in determining the functional integrity of the system. 
We will then turn to consider the conditions for observing macroscopic spreading phenomena in the 
context of weighted networks, keeping in mind the previous results about the functional properties of 
infrastructure networks. We provide a general theory for spreading phenomena in weighted networks. 
Some analytical results are presented in the case of (correlated) generalized random graphs, for which 
we derive a very general criterion for the percolation threshold, that governs the spreading properties 
of many dynamical phenomena (see Appendix [BJ . 

In order to have a unitary view of these two topics, we can actually consider the idea of the existence of 
different scales for the dynamics that we have mentioned as an introductory argument in Chapter Q 
a slower timescale governs the evolution of the average traffic (i.e. the weights), and a faster one 
characterizes single spreading processes (and thus also the perturbations to the average quantities). 
When a cascade failure, a congestion phenomenon, or a disease pops out producing a perturbation of 
the normal functioning of the network, the underlying weighted structure appears as a fixed landscape, 
whose (weighted) properties can decisively influence the global effect of such a perturbation (e.g. the 
occurrence of a traffic collapse, a pandemic state, etc). The advantage of using quenched weighted 
networks is related to the possibility of describing dynamical phenomena as well as vulnerability easily 
using percolation-like approaches. 

4.1.2 Relation between percolation, vulnerability and spreading 

Percolation theory has been largely used in statistical mechanics in order to describe topological phase 
transitions in lattice systems, such as the emergence of global properties when some physical param- 
eter (temperature, concentration, etc) exceeds a critical value |2281 11181 1591 IT5H| . More precisely, a 
site percolation process can be sketched as follows. Each site of the lattice is occupied with a uniform 
probability q, called occupation probability, and empty with probability 1 — q. A cluster is a con- 
nected set of occupied sites. When the value q of the occupation probability is larger than a critical 
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Figure 4.1: Illustration of the relation between percolation, vulnerability and spreading processes. As 
a result of percolation (left), the nodes of the original graph are divided into occupied (blue) and 
unoccupied (red) nodes. On the right: the correspondence between node removal (top) or spreading 
(bottom). On average, the subgraph induced by occupied nodes (black nodes in the top figure) 
coincides with that covered by spreading (blue nodes in the bottom figure). 

value q c , the percolation threshold, a cluster with size of the same order of the system, i.e. infinite 
in the thermodynamic limit, appears. Analogously, the process can be defined on the bonds joining 
neighboring sites, with the only difference that the site occupation probability q is replaced by a bond 
occupation probability. The bond percolation corresponds to the site percolation on the conjugate 
lattice (obtained exchanging bonds with sites and viceversa), and gives similar results. 
Recently, percolation theory has been successfully introduced in the field of complex networks, with 
some seminal works on undirected uncorrelated random graphs 65 , 1801 179j . A number of other works 
followed, studying the cases of correlated |189U190ll23o"llrj3] and directed |108l 12201 EtB] random graphs. 
Percolation is intrinsically related to vulnerability. The process of random removal of a fraction g of 
nodes can be seen as a percolation process in which the occupation probability of the nodes is reduced 
to 1 — g. Actually, we can exactly map a random removal of nodes on a site percolation problem, with 
g = 1 — q and the same order parameter (i.e. the relative size of the giant component, or topological 
integrity) . Percolation theory predicts the presence of a threshold value g c = 1 — q c above which the 
giant component Q c disappears, i.e. the network is fragmented in many disconnected components of 
very small sizes (N g ~ O(logiV)) with respect to the total number N of vertices in the graph. 
The notion of percolation has then found applications in the context of spreading phenomena on 
networks, such as epidemic models [1821 [183 188 : for some of them, the connection with the per- 
colation is a longtime topic (see for example [1241 11381 12441 I217| ). while for other models, more 
rigorous descriptions, based on the analysis of population equations of epidemiology, are required 
[T751 12UT1 EU El CHI ES3 EEl HH E3 EH The models of epidemic spreading solved using perco- 
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lation theory identify vertices with individuals and edges as pairwise contacts between them |188) . In 
this context, the existence of a giant component means that we can reach a large (infinite) number of 
other nodes of the same graph starting by one of them and moving along the edges. The spreading 
process takes place on the edges, that have a fixed probability to be disease-causing contacts, called 
transmissibility. The model's behavior is easily determined using edge percolation, that establishes 
the presence of a giant component as function of the value of transmissibility, and states the possibility 
for global epidemic outbreaks. Figure UTtl illustrates the relation between percolation, vulnerability 
and spreading processes. In Ref. [EI], we have extended the analogy between spreading processes and 
percolation to a very general situation in which node and edge inhomogeneity is considered. Part of 
the analysis is reported in Appendices IAIBI As a special case of such inhomogeneous percolation, in 
Section f4.3l we expose some results that are useful to study spreading processes on weighted networks. 
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4.2 Vulnerability of Weighted Networks: a case study 

Recent studies on phenomenological data of real weighted networks (communication and infrastructure 
networks, scientific collaboration networks, metabolic networks, etc) have given much attention to the 
relation between weighted properties and topological quantities |23llSl lT7I] . The most striking result 
concerns the existence of non-trivial correlation between weights and topology, corroborating the idea 
that topological and weighted properties are mutually influenced, and both take part into the evolution 
process of real weighted networks. 

Several groups have proposed models of evolving weighted networks j^J 1261 1271 1391 11071 1243| ; we will 

consider only one of them, the spatial BBV model, that has been described in Section Tl. 4. 31 

In the present section, we face the issue of defining and determining the vulnerability of weighted 

networks. It is firstly a problem of definition, since in many real networks, together with the topological 

connectivity there are also weighted properties, that are usually related with the functionality of the 

network. 

From a topological point of view, the only relevant measure of the vulnerability of a network is the 
size N g of the giant component after the removal of a fraction g of the nodes. If we compare this size 
with the original size No of the network's giant component, we have an estimate of the vulnerability 
of the network under node (or edge) removal. 

Using the already mentioned analogy with percolation, it has been shown that, in contrast with 
regular lattices and homogeneous random graphs, heavy-tailed networks can tolerate very high levels 
of random failure, without falling into pieces |8U1 165) . On the other hand, malicious attacks to the 
hubs can easily break the entire network into small components, providing a clear identification of the 
elements which need the highest level of protection against such attacks jSJ 1201] . 
In order to extend the vulnerability analysis to weighted complex networks, we have investigated how 
the introduction of traffic and spatial properties may alter or confirm the above findings. In particular, 
in Ref. |89|. we address three main questions: 

(i) on the basis of which measures a weighted network is most efficiently protected from damage; 

(ii) how traffic and spatial constraints influence the system's robustness; 

(iii) how to measure the damage (different integrity measures). 

The first step in order to face these issues consists in proposing a series of topological and weight- 
dependent centrality measures that can be used to identify the most important vertices of a weighted 
network. The functional integrity of the whole network depends on the protection of these central 
nodes, as we will show applying these considerations to the World-wide Air-transportation Network. 
The main result is that under malicious attacks, weighted networks turn out to be more vulnerable 
than expected, in that the traffic integrity is destroyed when the topological integrity of the network is 
still very high. 

4.2.1 Weighted centrality measures 

We introduce a series of weighted centrality measures, that will be used to define attack strategies 
of node removal on the network. Since we consider the case of technological networks, such as the 
airport network, we focus on weighted centrality measures based on traffic and spatial properties. The 
presence of non-trivial correlations between the different centrality measures reveals that all of them 
should be considered in order to give an exhaustive description of weighted network's structural and 
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functional properties. 



Measures definition - The centrality of nodes can be quantified by various measures, the most 
intuitive of which is certainly the degree. The natural generalization of the degree to a weighted graph 
is given by the strength of vertices s$ (see Section T2. 2. 61 for definitions), that in the case of the air- 
transportation network quantifies the traffic of passengers handled by the airport i. The corresponding 
geographical notion is the distance strength Di, that we defined as 

A = Yl d v > (4- 1 ) 

jev(i) 

where dij is the Euclidean distance between i and j, and quantifies how long are the connections 
supported by an airport. Another interesting definition is given by mixing the two ingredients; the 
outreach 

O l = Wi o d H > ( 4 - 2 ) 

ieV(i) 

measures the total distance traveled by passengers from the airport i and is possibly related to the 
total cost of connections. 

In general, non-local effects are taken into account introducing the betweenness centrality (see Sec- 
tion 12.2. 4~|l . that identifies crucial nodes which may have small degree or strength but act as bridges 
between different parts of the network. In weighted networks, it seems natural to generalize the no- 
tion of betweenness centrality through a weighted betweenness centrality in which shortest paths are 
replaced with their weighted versions. As already anticipated in Section l2.2.6l a straightforward way 
to generalize the hop distance in a weighted graph consists in assigning to each edge a length 1^ 
that is a function of the characteristics of the link. In our case, both the weight Wij and the Euclidean 
distance dij between airports i and j are reasonable quantities. A quite natural assumption is that the 
effective distance between two linked nodes should be a decreasing function of the weight of the link. 
Indeed, larger flows (traffic) correspond to more frequent and faster exchange of physical quantities 
(e.g. information, people, goods, energy, etc). Moreover, when spatial embedding is considered, the 
edge length must be directly proportional to the geographical distance. In other words, we define the 
length 1^ of an edge as 

lij = ^ ■ (4.3) 

Wij 

The definition of weighted betweenness centrality follows straightforwardly. 

According to this definition, the centrality of a node provides a trade-off between the idea of the topo- 
logical "bridge" connecting different parts of a network, and the requirement of carrying a sufficient 
level of traffic. 



Centrality Measures Correlation - On the World-wide Air-transportation Network, all the 
proposed measures of centrality are broadly distributed, and they are on average non-trivially corre- 
lated one with the other, and in particular with the degree (e.g. D(k) ~ k^ D with (3d w 1.5 [2Z|, and 
0{k) ~ fc^ , with O ~ 1.8 [ESI). 

Vertices with large degree have also typically large strength and betweenness, but under a detailed 
analysis, we observe that deviations are possible. This fact has been already noted in Refs. |14fl I27| . 
where it is stressed that the most connected airports do not necessarily have the largest betweenness 
centrality. 

Here, we observe a similar effect about the relation between topological and weighted measures. For 



4.2. VULNERABILITY OF WEIGHTED NETWORKS: A CASE STUDY 



77 



Rank 


Degree 


Betweenness 


Strength 


Outreach 


W. Betweenness 


1 


Frankfurt 


Frankfurt 


Atlanta 


London-LHR 


London-LHR 


2 


Paris-CDG 


Paris-CDG 


Chicago-ORD 


L.Angeles-LAX 


L.Angeles-LAX 


3 


Munich 


Anchorage 


London-LHR, 


Tokyo-HND 


Singapore 


•1 


Amsterdam 


Tokyo-HND 


Tokyo-HND 


Frankfurt 


New- York-JFK 


5 


Atlanta 


Port Moresby 


L.Angeles-LAX 


Paris-CDG 


Miami-MIA 


6 


London-LHR 


London-LHR 


Dallas-DFW 


Chicago-ORD 


Tokyo-HND 


7 


Chicago-ORD 


L.Angeles-LAX 


Paris-CDG 


New- York-JFK 


Hong-Kong 


8 


Duesseldorf 


Singapore 


Frankfurt 


Singapore 


Sydney 


9 


Vienna 


Vancouver 


Phoenix 


San Francisco 


Chicago-ORD 


10 


Brussels 


Bangkok 


Denver 


Atlanta 


Paris-CDG 


11 


Dallas-DFW 


Johannesburg 


Hong Kong 


Hong-Kong 


Seattle 


12 


Houston-IAH 


Toronto- YYZ 


Detroit-DTW 


Bangkok 


Atlanta 


13 


Rome-FCO 


Amsterdam 


Minneapolis-MSP 


Amsterdam 


Dubai 


11 


Minneapolis 


Seoul-ICN 


Madrid 


Dallas-DFW 


Frankfurt 


15 


Zurich 


Hong Kong 


Houston 


New- York- EWR 


Brisbane 



Table 4.1: Ranking of the fifteen world's top airports for different centrality measures: degree, topo- 
logical betweenness, strength, outreach and weighted betweenness. Note that the lists report different 
single airports, not just the total data relative to the cities; for cities with more than one airport, the 
acronym of the corresponding one is indicated. 



instance, the scatter plot of the weighted betweenness vs. topological betweenness (not shown) shows 
that departures from a perfect correlation are not that rare. Let us consider the list of Top 15 airports 
according to different measures of centrality ( Table l4~T|l : strikingly, each definition provides a different 
ranking. In addition, some airports which are very central according to a given definition, become 
peripheral according to another criteria. For example, Anchorage has a large betweenness centrality 
but ranks only 138 th and 147 th in terms of degree and strength, respectively. Similarly, Phoenix or 
Detroit have large strength but low ranks (> 40) in terms of degree and betweenness. 
A quantitative analysis of the correlations between two rankings of n objects can be done using rank 
correlations such as Kendall's r |154l 1207) 

(4.4) 



n(n - l)/2 

where n c is the number of pairs whose order does not change in the two different lists and n<j is the 
number of pairs whose order is inverted. This quantity is normalized between — 1 and 1: r = 1 corre- 
sponds to identical ranking while r = is the average for two uncorrelated rankings and r = — 1 is a 
perfect anticorrelation. Table l4~2l gives the values of r for all possible pairs of centrality rankings. For 
N = 3,880, two random rankings yield a typical value of ±10~ 2 so that even the smallest observed 
r = 0.21 is the sign of a strong correlation (All the values in this table were already attained for a 
sublist of only the first n most central nodes, with n » 500). Remarkably enough, there is a strong 
correlation even between local and non-local quantities (degree vs. betweenness), but the weighted 
quantities seem to have a lower level of correlation with the topological ones. 



Another important aspect concerns the geographical relation between centrality measures. Indeed, 
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/7\ 



/A 
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Degree Strength Outreach BC 



WEC 



□ Africa+ME 

□ Asia+Oceania 

□ Europe 

□ N. America 



Figure 4.2: Geographical distribution of the world's 15 most central airports ranked according to 
different ccntrality measures. 



they are non homogeneously distributed in different geographical regions, providing a further precious 
hint to understand the functioning of infrastructure networks such as the WAN. Figure H~2l displays 
the geographical distribution of the world's fifteen most central airports ranked according to different 
centrality measures. On the one hand, it is clear that topological measures miss the economic dimen- 
sion of the WAN, while weighted measures reflect traffic and economic realities. 

European airports are very well connected but the core of the traffic is North America, where we 
find the peaks of strength and outreach. Betweenness based measures on the other hand pinpoint the 
most important nodes in each geographical zone. In particular, the weighted betweenness appears as 
a balanced measure which combines traffic importance with topological centrality, leading to a more 
uniform geographical distribution of the most important nodes. 



4.2.2 Vulnerability of the WAN 

The analysis of complex networks robustness has been largely investigated in the case of unweighted 
networks |51 l%0*ll67flll44| . focusing on the topological integrity of the network N g /N . When, increasing 
g, we reach N g ~ 0(1), the entire network has been destroyed. We say that the attack strategy is 
"driven" by a property of the network, if the order in which the nodes are one-by-one removed 
corresponds to the ranking list according to that property. On heterogeneous networks, a degree- 





k 


D 


s 


O 


BC 


WBC 


Degree k 


1 


0.7 


0.58 


0.584 


0.63 


0.39 


Distance strength D 


0.7 


1 


0.56 


0.68 


0.48 


0.23 


Strength s 


0.58 


0.56 


1 


0.83 


0.404 


0.24 


Outreach 


0.584 


0.68 


0.83 


1 


0.404 


0.21 


Betweenness BC 


0.63 


0.48 


0.404 


0.404 


1 


0.566 


Weighted BC 


0.39 


0.23 


0.24 


0.21 


0.566 


1 



Table 4.2: Similarity between the various rankings as measured by Kendall's r. For random rankings 
of N values, the typical r is of order 10~ 2 . 
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Figure 4.3: Effect of different attack strategies on the size of the connected giant component (top) and 
the outreach integrity (bottom). The attack strategies consist in removing nodes in order of: degree 
(black circles), strength (red squares), outreach (green diamonds), distance strength (blue triangles), 
betweenness (magenta crosses) and weighted betweenness (orange stars). 

driven damage strategy (i.e. nodes are removed in order of degree, starting from the maximum one) is 
extremely effective, leading to the total fragmentation of the network at very low values of g |8()U65l l5]. 
but the removal of the nodes with largest betweenness typically leads to an even faster destruction of 
the network |144) . 

In the case of weighted networks, we consider additional quantities related to the functionality of the 
network, such as the largest traffic or strength still carried by a connected component of the network. 
In Ref. (EH|, we have defined three new measures of network's damage 

m = %-, Io{9) = %-, Id(9) = ^, (4.5) 

Do Co Uq 

where So = Si, Oq = Oi and T>q = £^ Di are the total strength, outreach and distance strength 
in the undamaged network and S g = max# ^2 ieH Si, O g — max# ^2 ieH Oi and T> g — max// ^2 ieH Di 
correspond to the largest strength, outreach or distance strength carried by any connected component 
H in the network, after the removal of a density g of nodes. These quantities measure the integrity 
of the network with respect to either strength, outreach or distance strength, since they refer to the 
relative traffic or flow that is still handled in the largest operating component of the network. 
Under random damage, all the integrity measures behave similarly to the simple topological case, i.e. 
weighted networks are inherently resilient to random damages. This result is in agreement with the 
theoretical prediction of the absence of a percolation threshold in highly heterogeneous graphs |8()llfi5| . 
but it is not completely expected in weighted graphs. The scenario corresponding to the damage of 
the most central nodes in the network is very different and depends on the strategy considered. We 
have eliminated nodes (and corresponding links) according to their rank in terms of degree, strength, 
outreach, distance strength, topological betweenness, and weighted betweenness. In Fig. l4.3l we report 
the behavior of N g /No and of the outreach integrity Io(g) f° r all cases (the other integrities giving 
similar results). As expected, all strategies lead to a rapid breakdown of the network with a very 
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g 

Figure 4.4: Geographical effect of the removal of nodes with largest strength. The integrity decreases 
strongly in regions such as North- America, while a "delay" is observed for the zones with smaller 
initial outreach or strength. 



small fraction of removed nodes. The structural integrity of a network is principally due to strategic 
points such as bridges and bottle-neck structures; as evidenced by the fact that the size of the giant 
component decreases faster using non-local measures of centrality (i.e. betweenness) instead of local 
ones (i.e. degree, strength). In Ref. |144) . it was already shown that the betweenness is the most 
effective quantity in order to pinpoint such nodes. 

However, the effects are reduced if we take into account weights: some of the important topological 
bridges carry a small amount of traffic and are therefore part of more shortest paths than weighted 
shortest paths. These bridges have therefore a lower rank according to the weighted betweenness. 
Note that, among the local quantities, the distance strength is rather effective since it targets nodes 
which connect very distant parts of the network. 

The picture changes when the attention is shifted to the weighted integrity measures. In this case, 
all strategies achieve the same level of damage, and their decrease is even faster than for topological 
quantities. In other words, the functionality of the network can be temporarily jeopardized in terms 
of traffic even if the physical structure is still globally well-connected. This implies that weighted 
networks appear more fragile than expected by considering only topological properties. All targeted 
strategies are very effective in dramatically damaging the network, reaching the complete destruction 
at a very small threshold value of the fraction of removed nodes. 

Finally, we consider the role of the spatial constraints. As shown in Fig. 14.21 various geographical 
zones contain different numbers of central airports. The immediate consequence is that the different 
strategies for node removal have different impacts in different geographical areas. Figure FOI displays 
the case of a removal of nodes according to their strength (other removal strategies lead to similar 
data), monitoring topological and outreach integrity. Each geographical zones is more sensitive to 
a particular removal strategy, leading to the idea that large weighted networks can be composed by 
different subgraphs with very different traffic structure and thus different responses to attacks. 
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Figure 4.5: Vulnerability of a network obtained using the growing model with spatial constraints 
exposed in Section T2.4.3I for two values of the parameter controlling the spatial constraints T) = 0.5 
(left) and rj = 0.005 (right). The targeted damage is driven removing nodes in order of: degree 
(circles), strength (squares), and topological (crosses) and weighted (triangles) betweenness. The 
behaviors of the topological (N g /No) and functional (I s {g)) integrities are shown. 

4.2.3 Comparison with the spatial BBV model 

In Section ?2. 4. 31 we have presented a model of growing weighted networks, proposed in Refs. |24l 125) 
and based on the preferential attachment. By means of a strength-driven preferential attachment 
(i.e. a sort of "busy-get-busier" principle), the model generates networks with properties that are 
observed in many real growing weighted networks, such as the small-world property and power-law 
distributed degree, strength and weights. We have already observed that the BBV model pinpoints the 
basic mechanisms governing the evolution of the airport network, even if it fails in the existing large 
fluctuations between topological and weighted quantities (such as degree vs. strength or topological 
betweenness vs. weighted betweenness). 

A possible reason for the existence of these non-trivial correlations is the presence of spatial constraints, 
i.e. the node of the real networks are embedded in a two-dimensional space. The authors of Ref. |24) 
have thus put forward a modified model, in which spatial constraints are taken into account |27| . The 
spatial model shows more realistic correlations, but it does not explain all the features observed in 
the real airport network. 

In this section, we present some preliminary results on the study of the vulnerability properties of 
the model, pointing out the relation with the effects observed for the WAN. In particular, we look at 
the role played by spatial constraints, fixing the parameter 6 — 1 and varying the typical length scale 
of the connections governed by the parameter rj. The system is embedded in a square of linear size 
L = 1, endowed with euclidean metric structure, so that small values of r\ (rj <C 1) favor the creation of 
" regional" structures with hubs of smaller connectivity than the global hubs obtained in the original 
BBV model or in the present model when rj assumes larger values. We have considered the two cases 
rj = 0.5 and rj = 0.005. 

Figure ^31 displays the behaviors of the topological and functional integrities under different damage 
strategies. We have focused only on the main attack strategies, i.e. those driven by removing nodes 
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in order to degree, strength, betweenness and weighted betweenness. In this figure the weighted 
betweenness has been computed using the weighted distance £ij = 1/wij instead of £y = dij/wij, 
but the qualitative differences in the resulting vulnerability properties are negligible. (Note that this 
is true also for the WAN). Comparing the two figures for 77 = 0.5 (left) and 77 = 0.005 (right), it is 
clear that, increasing the spatial constraints, the network becomes more vulnerable to betweenness- 
driven attacks. This is due to the fact that the system develops a structure composed of many 
regional networks connected by a limited number of links. The links or the nodes that play as bridges 
connecting different regional networks are very important for the global structural properties of the 
system. They are the most central nodes in terms of both topological and weighted betweenness, so 
that damaging the network removing high-betweenness nodes reveals the fragility of such a regional 
structure. This feature is in agreement with the behavior of N g /No in the real airports network. 
In the model, the picture does not change if we look at the functional level, monitoring the decrease of 
the functional integrity (in this case we have considered I s (g) = S g /So, the other functional integrity 
measures show similar behaviors). The two non-local damage strategies are indeed more effective 
in destroying the network functionality. This feature is different from what we observe in the real 
airports network, that shows the same rapid decrease for all the strategies. The difference is possibly 
due to the fact that, in the model, the relation between topology and traffic is still too strong, while in 
the real WAN, other variables play an important role in separating the traffic and economic dimension 
from the purely topological one. 

In summary, the introduction of spatial constraints in a model of growing weighted networks produces 
some structural properties that are in agreement with those observed in a real network such as the 
WAN; but an improvement is required in order to pinpoint the correct mechanisms that are at the 
origins of those functional properties of real weighted networks that are still not completely understood. 

4.2.4 Conclusions 

In this study (see Ref. IHH]), we have identified a set of centrality measures for technological weighted 
networks, in which two further ingredients, traffic and spatial constraints, are included. The main 
achievements are summarized in the following points: 

• the various definitions of centrality are correlated, but lead to different descriptions of the im- 
portance of the nodes; 

• the level of vulnerability of a network depends on the global property we have decided to monitor; 
in general, the attack strategies that are based on the same property (or similar properties to) 
that we are monitoring at a global scale are trivially more effective; 

• spatial heterogeneity has to be considered, weighted regional properties are different; 

• weighted networks are, at a functional level, much more fragile than we would expect looking at 
a purely topological level. 

It is worth noting that, even if the the vulnerability analysis seems to be completely static, the previ- 
ous strategies are based on a recursive re-calculation of the centrality measures on the network after 
each damage. As noted by Holme et al. |144| . each node removal leads to a change in the properties 
of the other nodes. Only the neighboring nodes of the removed one suffer a change in the degree and 
strength, but the structure and number of shortest paths are modified for the whole graph and there- 
fore the betweennesses of all nodes may potentially be altered. It is therefore quite natural that, after 
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each node removal, the choice of the next discarded node should be made according to recalculated 
degree, strengths and betweennesses, and not according to the original ranking. Such procedure is 
somehow akin to a cascade failure mechanism in which each failure triggers a redistribution on the 
network and changes the next most vulnerable node. 

Actually, we have found [S^j that recalculated rankings do not differ considerably from the original 
ones, even for non-local quantities like the bctwccnncss. This result has two important consequences. 
First, it points out the validity of considering static analysis, such as that of vulnerability (and indi- 
rectly percolation), in order to study dynamical property of networks, such as spreading and cascade 
failures or congestions. The other observation concerns the protection of large scale infrastructures. 
On the one hand, the planning of an effective targeted attack does need only to gather information 
on the initial state of the network. On the other hand, the identification of crucial nodes to protect is 
an easier task, since it is only weakly dependent on the attack sequence. 
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4.3 Spreading processes on Weighted Random Graphs 

This second part of the chapter is devoted to study spreading phenomena on weighted networks. 
According to the static analysis made in the previous section, at a functional level, weighted networks 
look much more vulnerable. In other words, network's functionality suffers of the disconnection of 
some high-traffic portions of the system. For the same reasons, we could expect to observe the same 
features in the behavior of spreading processes on weighted networks. More precisely, we want to model 
a process that exploits the traffic to spread throughout the network, the efficiency of the spreading 
depending on the weights properties. 

In Section 14.1.21 we have explained the relation between spreading processes and percolation, stressing 
that there is a correspondence between the existence of a giant component in the percolation process 
and the possibility of a macroscopic spreading. Of course, the analogy becomes evident if we think 
to simple examples, like the outburst of epidemics in a population of individuals, or the diffusion of a 
virus on the Internet, but also the spreading of information or other physical quantities on a generic 
network. 

Nevertheless, the standard percolation cannot encode all the features of such spreading processes, in 
which the spreading rates depend on the properties of the nodes and the edges, and on the details of 
the process. For instance, in the Internet, the connections between computers are real cables, thus the 
transmission along them depends on their bandwidth; in the air-transportation network, the edges 
are weighted with the number of available seats, that is a measure of the their capacity; in social and 
contact networks as well, the importance of links is related to the type of relationships between the 
individuals. It is clear that real processes are influenced by these factors. 

According to this remark, we develop a general theory of spreading on networks, in which nodes and 
edges are endowed with different spreading properties. In order to express the ability of an edge 
to transfer some physical quantities (information, energy, diseases, etc), we introduce an edge 
transition probability Ty, that depends on the properties of the vertices i, j and of the edge itself. 
In addition, in real processes the nodes have different ability to spread, thus they should be supplied 
with a node traversing probability qi. In this manner, the actual size of the giant component can be 
affected by the non-optimal or non-homogeneous flow through the nodes and edges, the most general 
model performing an inhomogeneous joint site-bond percolation. On the other hand, information about 
simpler processes such as site percolation with inhomogeneous bonds (edges) or bond percolation with 
inhomogeneous sites (nodes) can be easily obtained assuming, in turn, qi or as uniform. In these 
particular cases, we will specify them as node (q) or edge (T) occupation probability (in agreement 
with standard percolation processes) . 

This formalism fits very well the description of spreading phenomena on weighted networks, since we 
can choose the transition probabilities to be functions of the weights (Ty = T(my)). We can also define 
a weighted giant component as the part of the giant component that is reached in the spreading process 
and gives a measure of the functionality of the network. For instance, for the spreading of information, 
it corresponds to the maximal subgraph that can bear and handle a large flow of information, ensuring 
the global well-functioning of the network. In contrast, if we are dealing with epidemic spreading, 
the weighted giant component corresponds to the maximal infected region reachable in the pandemic 
regime. In both cases, topological giant component is reduced by introducing weights. 
For the sake of simplicity, the rest of this section is devoted to discuss spreading phenomena in weighted 
random networks, and the possible applications in the description of real processes. This is done fixing 
q-i = q and studying a site-percolation problem. On the other hand, the model of generalized spreading 
phenomena on networks goes beyond the description in term of weighted networks and can be analyzed 
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Figure 4.6: (A) Graphical representation of the Eq. 14.61 Fat lines indicate the contribution coming 
from the unknown probability p^ that an edge does not belong to the infinite cluster. The first 
term on the right-hand side represents the contribution coming from the probability 1 — Ty that a 
flow does not traverse the edge the second term accounts for the probability that the edge 

is traversed times the unknown contribution pjh of other edges attached to j. (B) Diagrammatic 
representation of the self-consistent Eq. 14.71 Fat lines correspond to the unknown uniform probability 
p({T}) that an edge does not lead to the giant component. Full lines mean that edges can be crossed, 
with probability T, while the dashed edge corresponds to the contribution 1 — T that an edge cannot 
be crossed. 



in a very general and abstract form. The general theory by means of which the most relevant 
results have been derived, is reported in the Appendix [51 while a brief introduction to generating 
functions in graph theory is reported in the Appendix lAl 



4.3.1 Generalized Molloy-Reed Criterion in Weighted Random Graphs 

The idea of a percolation process in which the basic elements (nodes and edges) present inhomogeneous 
properties goes back to the physical studies of conduction in resistor network models (see Ref. |156| 
and references therein). A typical way to introduce disorder on the edges (or on the nodes) is that of 
assigning to each of them a random number between and 1, then removing all edges (nodes) in a 
selected interval of values. Percolation properties are studied as functions of the fraction of remaining 
edges (nodes) in the strong disorder regime |58l 1164) . 

However, the percolation processes we want to describe is more general than the theory of random 
resistor networks. The latter, indeed, corresponds to the case in which the edge transition probability 
is a step function (the Hcavisidc 8(Tij — To) for a reasonable value 2q), so that transmission occurs only 
along a fraction of the edges (with sufficiently high conductance) and, for those edges, it is optimal. 
In other words, it corresponds to a bond percolation, as already noted in Ref. 188 . Here, we are 
interested in a site percolation process in which, additionally, there is a quenched variable expressing 
the probability of passing through an edge. This quenched probability rate is a function of the weights 
along the edges, i.e. Tjj = T(wij). 

We use an intuitive derivation of the percolation criterion, that generalizes the (Bethe Approximation) 
approach introduced by Cohen et al. in Ref. 80 for uncorrelated sparse random graphs. Let us call 
Pij the probability that an edge in the network does not lead to a vertex connected via the remaining 
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edges to the giant component (infinite cluster) and Ty the probability that a flow leaving the node i 
passes trough the edge reaching the node j. Both py and Ty are defined in the interval [0, 1]. 

In general, p^ depends on the transition probability of the edge through a term 1 — Ty, 

and on the probability that the node j, if reached, does not belong to the infinite cluster. This 
contribution can be computed as the joint probability that each one of the remaining edges emerging 
from j does not belong to the giant component. Assembling the different contributions, we get the 
recursive expression (Fig. I4.61 A) 

pij = 1 - + Yl Pjh i (4-6) 

h e V(i) 
h ± i 

that contains a product of kj — 1 terms pjh if kj is the degree of the node j and V(j) is its set of 
neighbors. This recursive procedure is not closed, but the introduction of some hypotheses on the set 
of transition probabilities {T} allows a statistical reformulation of Eq. 14.61 

Let us suppose that the transition probabilities are random variables with a given distribution pt, and 
the average overall probability p that a randomly chosen edge does not belong to the giant component 
depends only on some general properties of that distribution such as the mean, the variance, etc (i.e. 
P = P({T})). 

Note that in this section (as well as in Appendices I AIB(I we change notation with respect to Chapter|3| 
in which the degree distribution is indicated as P(k); here, we use instead pk for the degree distribution. 
By picking up an edge at random in a random graph, the probability that it is connected to an edge 
of degree k is ^y-, where pk is the degree distribution of the graph. We can now write a self-consistent 
equation for p{{T}) fFig.lPlBL 

p({T}) = 1-T + TJ2 ^Pimf- 1 , (4.7) 

in which T is a realization of the independent and identically distributed (i.i.d.) random variables {T} 
in the interval [0,1]. Averaging both the members of the Eq. 14.71 on pt , we obtain that the probability 
p depends only on the average value (T), i.e. p({T}) = p((T)), and the equation becomes 

p({T)) = 1 - <T) + (T) £ ^pmf- 1 = I[p({T))\ . (4.8) 

Apart from the trivial solution p = 1, another solution p = p* < 1 exists if and only if ^| p= i > 1. 
The curve I[p], indeed, is positive in p = (J[0] = 1 — (T) + (T) Pl /(k) > 0), therefore ^\ p=1 > 1 
means that it crosses the line f(p) = p in a point < p* < 1. The condition on the derivative of the 
r.h.s. of Ea. 14. 81 corresponds to 

(T) { -^^>1. (4.9) 

A generalization of the Molloy-Reed criterion |18Uj for the existence of a giant component in presence 
of random weights on the edges immediately follows, 

W- I + t?r <4 ' 10) 

meaning that, when the transition probabilities are i.i.d. random variables, a giant component exists 
if and only if the inequality is satisfied. The case of uniform transition probabilities is exactly the 
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same, with uniform value T instead of the average value (T). Note that, while in the case of perfect 
transmission (Ty = 1) the usual formulation of the Molloy-Reed criterion is recovered |180| . when 
(T) < 1 the r.h.s. of Ea . l4~lT)l can grow considerably, affecting the possibility of observing percolation: 
the smaller is the average transition probability, and the larger are degree fluctuations (k 2 ) needed to 
ensure the presence of a giant component. 

For a random graph with poissonian degree distribution (i.e. pk = e~^(k) /k\), the criterion in 
Ea. 14.101 corresponds exactly to have (k) > 1/(T). We show in Fie. 14. 71 - A the results of numerical 
simulations for the giant component's computation on an Erdos-Renyi random graph with N = 10 4 
nodes and with random transition probabilities uniformly distributed in [a, b] with < a < b < 1. 
A giant component clearly appears when the mean connectivity (k) exceeds the inverse of mean 
transmissibility value (b — a)/2. Considering different distributions (e.g. binomial distributions) for 
the i.i.d. random variable T does not affect the results. 

For heterogeneous graphs with broad degree distribution, the inequality in Ea. l4.1()| is satisfied thanks 
to the huge fluctuations of the node degrees ensuring the l.h.s. to be larger than 1 + \/(T). In 
particular, when the graph has power-law degree distribution pk ~ mk 1 (2 < 7 < 3, m is the 
minimum degree), the fluctuations diverge in the limit N — > 00. In this case, the giant component 
always exists if N is large enough. If 7 > 3, the second moment is finite, and Ea. 14.101 provides the 
following bound for the average transition probability necessary to have a giant component (computed 
using the continuum approximation for the degree), 

(T) > (4 11) 

K ' ~ 7 (m- l) + 3-2m { ' 

The inequality is always satisfied for m > 1 + 1/ (T) , while for m < 1 + 1/ (T) it is satisfied when 
7 < 3 + i_(m^i)(T) ' ^ 0r m ~ 1 ' an infinite (uncorrelated) scale-free graph presents a giant component 
of order O(N) only when 7 < 3 + (T) < 4. At 7 = 3, logarithmic corrections should be taken into 
account |8"H] . 

Actually, real networks are large but finite, and the second moment cannot diverge, thus we expect 
that the condition is not always satisfied. This is clearly shown in Fig. l4.7l -B. where we have reported 
the size of the giant component for power-law random graphs as a function of the exponent 7 for 
different values of the average edge transition probability (T). The size of the giant component is 
dramatically affected by low transition probabilities. Now, the argument of Cohen et al. IBO] can 
be used to compute the threshold of site percolation with random disorder on the edges. If q is the 
(uniform) node occupation probability in an uncorrelated random graph, then a randomly chosen 
node, whose natural degree in case of q = 1 is k', will assume a degree k < k' with probability 



k' 
k 



q k (l-qf- k , (4.12) 



that means that the corresponding degree distribution Pk(q) is related to the degree distribution 
Pk'(q = 1) by 

Pk(q) = £ p*(g = 1) f k J q k (l - qf- k . (4.13) 

It follows that (k) q = q{k) 1 and (k 2 ) — q(k 2 ) q=1 +q(l—q)(k) q=1 , that introduced into the expression 
Ea. l4.10| for the Molloy-Reed criterion gives the expression of the threshold value for site percolation, 
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Figure 4.7: (A) Relative size of the giant component Q c vs. average degree (k) in an Erdos-Renyi 
random graph in which the edge transition probabilities are uniformly random distributed with mean 
value (T). Simulations are performed on a sample of 100 graphs with N — 10 4 nodes. The different 
curves refer to different values of average transition probability: (T) = 1 (circles), 0.75 (squares), 0.5 
(up triangles) and 0.25 (down triangles). (B) Relative size of the giant component Q c vs. the exponent 
7 in an power-law graph in which the edge transition probabilities are uniformly random distributed 
with mean value (T). Simulations are performed on a sample of 100 graphs with N — 10 4 nodes, and 
minimum degree m = 1. The different curves refer to different values of average transition probability: 
(T) = 1 (circles), 0.75 (squares), 0.5 (up triangles) and 0.25 (down triangles). 



Eg. 14. 14| shows that decreasing the average transmission capacity of the edges (i.e. the edge transition 
probability), the value of node occupation probability necessary to ensure percolation increases. 
The expression of the percolation threshold in Eq. 14.141 is not very general, since it is correct only 
for randomly distributed weights (and thus transition probabilities). On the contrary, in many real 
networks (e.g. the WAN) the weights are correlated with the degree, Wij = w(ki, kj). 
Consequently, a natural generalization is that of considering T — T^p... Of course, in this case the 
previous derivation breaks down, since two types of correlations have to be considered: correlations be- 
tween edge transition probabilities and degree, and degree-degree correlations. The correct approach, 
reported in the Appendix [Bj is that of using the generating functions formalism in the framework of 
Markovian correlated random graphs (i.e. with only degree-degree correlations) |43 |. 
We consider here the simpler case in which the transition probability depends only on the degree of the 
departure node (i.e. = T^J. The self-consistent equation ^21 can be straightforwardly generalized 
to consider the case T = T k , obtaining the following expression for the percolation threshold, 

(k) 

^ = (k*T k )- { kT k ) ■ < 4 ' 15 ) 

The striking property is that, for particular forms of the transition probability, a finite value for the 
percolation threshold can be restored also in infinite scale-free networks. We will investigate this case 
and other possible situations in the following section. 



4.3.2 Examples and Numerical Simulations 

The functional dependence of the transition probabilities {T} is strongly related to the details of 
the system and to the type of spreading process. Therefore, in absence of hints from studies of 
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Figure 4.8: Size N q /N of the giant component Q g as function of the fraction q of occupied vertices for 
(T) = 0.25,0.5,0.75, 1 for an Erdos-Renyi random graph (A) with N = 10 5 vertices and (k) = 10 and 
a power-law random graphs (B) with same size, exponent 7 = 2.3. All curves have been averaged over 
100 realizations. The predicted values of q c — 0.4,0.2,0.133,0.1 are well verified in the Erdos-Renyi 
graph by the values of q at which N q ~ O(logiV). The power-law graph gives slightly worse results, 
but still in agreement with the theoretical values q c = 0.387,0.20,0.19,0.1 (that are indicated by 
arrows on the curves). 

real data about spreading processes on networks, the present section is devoted to highlight some 
simple examples of reasonable functional forms for the transition probability, and to illustrate their 
effects on the percolation condition. We perform our analysis on two main classes of random graphs: 
homogeneous graphs, in which the connectivity distribution is peaked around a characteristic degree 
value, and heterogeneous graphs, whose degree distribution is broad, with very large fluctuations in 
the degree values. 

Random Transition Probabilities - Taking randomly distributed transition probabilities 
Tij (and weights) can appear unrealistic in technological networks, but it may be interesting to model 
spreading processes on some very disordered network. We have checked the validity of Eq. 14.141 
performing simulations on an Erdos-Renyi random graph with (k) = 10 and on a power-law random 
graph with exponent 7 = 2.3. Both graphs have N = 10 5 nodes and the transition probabilities 
are assigned randomly between and 1. Fig. I4.8I -A reports the size of the giant component of the 
Erdos-Renyi graph as function of the fraction q of occupied vertices for (T) = 0.25,0.5,0.75,1; the 
predicted values q c = 0.4, 0.2, 0.133, 0.1 are well reproduced by simulations. The same measures for a 
power-law graph are shown in Fig. I4.8I -B. The curves of the simulations are in good agreement with 
the theoretical values of the percolation threshold q c ~ 0.387 (for (T) = 0.25), 0.20 (for (T) = 0.5), 
0.19 (for (T) = 0.75), 0.1 (for (T) = 1.0). 1 

Single-vertex dependent transition probability - The assumption of transition probabil- 
ities T — Tfc yields different results depending on the exact functional dependence on the degree of 
the initial vertex. A first example is that of a monotonously increasing function of the degree, that 

lr The estimation of the threshold value, as explicitly shown in Ref. 1841 . is a very difficult task in power-law graphs, 
given that the shape of the curves decreases with constant convexity, converging very smoothly to zero. Indeed, the 
probability of nodes with maximum degree is 0(l/N), thus in a finite graph their frequency depends on the size of the 
system. Since the flows are clearly unbalanced in favor of large degree nodes, finite size effects are more relevant on 
these types of processes. 



90 



CHAPTER 4. SPREADING AND VULNERABILITY 




i 



0.8 




0.6 








0.8 



0.6 




x x a = 0.2, y = 2.8 
O— O a = l,y = 2.8 




0.2 



3.2 

y 




ir 



0.4 



0.6 



0.8 



1 



Figure 4.9: (A) Critical density q c of nodes ensuring the existence of a giant component Q g as a 
function of the exponent 7 for a power-law graph in which the transition probability function is 
single-vertex depending with the form Tk ~ 1 — exp(— m)< The figure reports the theoretical curve. 
The parameter a controls the saturation to 1 for highly connected nodes: the cases a = 0.2 (circles) 
and a = 1 (squares) are shown. For large values of the parameter a, the curves rapidly converge to 
the limit behavior of homogeneous percolation (limit a — > 00). For a < 1, the curves shift from this 
limit, but in the range 2 < 7 < 3 diverging fluctuations ensure the percolation. (B)-(C) The curves 
N q /N obtained by numerical simulations on power-law random graphs of size N — 10 5 and exponent 
7 = 2.4 and 7 = 2.8 for the two values of a = 0.2, 1. The arrows indicate the positions of the threshold 
predicted using the theoretical calculation of Ref. [El] ■ 

saturates to 1 at large k values: we consider = ~ 1 — exp(— -m-), that can be applied also to 
infinite graphs. The parameter a controls the convergence to 1. For large scale- free networks, however, 
the mere presence of a saturation, rather than its rapidity, is sufficient for the system to behave as in 
the standard percolation, showing that graph's heterogeneity ensures a zero percolation threshold. In 
general, from Eg. 14.151 simple calculations lead to the expression of the threshold for site percolation 
(see Ref. [Hi])- Its behavior is sketched in Fig. I4.9I -A as a function of 7 and for some values of the 
parameter a. It shows that the critical fraction of occupied nodes q c is exactly for 2 < 7 < 3. Also 
for 7 > 3, the behavior is qualitatively the same as for standard percolation on scale-free networks. 
The reason is essentially that a transition probability converging to 1 for highly connected nodes does 
not affect the network's properties if the network is infinite, because there is always a considerable 
fraction of nodes with optimal transmission. 

This condition is not trivially satisfied by finite graphs, and the presence of a cut-off in the degree 
can have a strong influence on the network's functional robustness. Two kinds of cut-offs on power- 
laws have been largely studied in the literature: an abrupt truncation of the degree distribution at a 
maximum value k N ~<~ 1 , or a natural exponential cut-off, i.e. pk ~ fc _7 e _fc / K . Without entering in 
the details of the analytic and numerical investigation, the main result is that the introduction of a 
cut-off actually reduces the effects of degree fluctuations, producing finite percolation threshold also 
in the range 2 < 7 < 3. 

However, this choice of edge inhomogeneity does not seem to enrich the scenario obtained by standard 
percolation, though the case of slow convergence (a < 1) produces non negligible effects on finite sys- 
tems (panels B and C in Fig. 14.9(1 . The structural properties of homogeneous graphs, on the contrary, 
are always deeply affected by this type of transition probability, expecially when perfect transmission 
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Figure 4.10: (A) Fraction q c of occupied nodes required to have a giant component as a function of 
the exponent 7 for a power-law graph in which the transition probability is the single-vertex function 
Tk ~ k~ a , with exponent < a < 1. The curves report the values of q c computed by Eg. 14.161 All 
curves for a = 0.25 (circles), 0.5 (squares) and 0.75 (triangles) show that the percolation threshold is 
increased compared to the homogeneous result (dashed line), also for an infinite graph. In particular, a 
finite threshold larger than appears also in the range 2 < 7 < 3. (B) Threshold value q c of occupied 
nodes as a function of the parameter a for power-law graphs with cut-off k = 10 2 and exponent 
7 = 2.3 (full line and circles) and 7 = 2.6 (dashed line and crosses). The symbols (circles and crosses) 
are results from simulations on random graphs of size N = 10 5 (average over 100 realizations); the 
lines are the corresponding theoretical predictions. 



is reached beyond the peak of the degree distribution (a < 1) [EI]- Another wide class of transition 
probabilities contains those converging to zero with increasing degree. At a first glance, it seems a 
very unphysical condition, but the problem can be inverted asking which is the maximal decay in the 
degree dependence still ensuring site percolation. 

For single- vertex transition probability Tk, Ea. 14.151 implies that, independently of the degree distri- 
bution, a graph does not admit percolation if Tk decays as Ti/k a with a > 1. Indeed, by substitution, 
q c = (k)/[T 1 {{k 2 - a ) - (fc 1_Q ))], that is larger than 1 because (k) > (fc 2 ~ Q ) > and 7\ < 1 

(assuming (k) > 1). This is actually equivalent to the result presented in Ref. [188 . 
Since a = corresponds to standard percolation, we are interested in the intermediate case < a < 1. 
In scale-free networks the percolation threshold is computed by series summation, 

^ = 7T- T TT - (4-16) 

C(7 + a - 2) - C(7 + « - 1) 

that has been plotted in Fig. 14. 101 - A as a function of 7 for different values of a between and 1 . The 
curves show that a transition probability moderately decreasing with the degree can induce power- 
law graphs with 2 < 7 < 3 to present a finite percolation threshold. However, many real networks 
(Internet, WWW, etc) have a power-law degree distribution with exponent lower than 2.5, for which 
the giant component persists even for relatively large a (at least in the approximation of infinite 
systems). Nevertheless, the giant component is reduced in finite systems or in presence of a cut-off 
on the degree as shown by the results of simulations reported in Fig. I4.10H 3. The figure displays 
the behavior of the threshold value q c as a function of the parameter a for two different exponents 
7 = 2.3,2.6. The lines are the theoretical predictions for (infinite) systems with cut-off k = 10 2 , the 
points are numerical results on finite systems with N = 10 5 and same cut-off. The threshold values q c 
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for finite sizes are reasonably smaller compared to the theoretical predictions for infinite graphs with 
the same cut-off on the degree distribution. 

4.3.3 Discussion and Conclusions 

This collection of examples is far from being complete, it should rather represent a way to better 
understand potential applications of calculations and formulae we have presented. Among other 
possible expressions for the transition probability, we mention non-monotonous or peaked degree- 
dependent transition probabilities, with an optimum of transmissibility for (not necessarily large) 
characteristic values of the degree. The actual spreading of a virus on the Internet might have such 
a behavior. Highly connected nodes have large exchanges of data with their neighbors, for that they 
should be potentially the nodes with highest transmission capability. On the other hand, there is a 
common awareness of their importance, thus they are better protected and controlled, so that the 
effective transition probability for the spreading of viruses from these nodes is strongly reduced. Very 
low-degree nodes are less protected but also less exposed to the transmission (their data exchanges 
are limited). Such scenario suggests that also non-monotonous transition probabilities are actually 
interesting in the study of dynamical processes on networks. 

In conclusion, this section has been devoted to the definition of a generalized percolation with edge 
transition probabilities that can be used to describe spreading processes. In the case of weighted 
networks, the edge transition probabilities should depend on the weights. We have obtained two 
important results: 

• When weights and degree are not correlated, it is reasonable to assume that also the edge 
transition probabilities arc randomly distributed (not strictly uniformly random). It follows 
that when the percolation threshold is present in the topological description, it is increased; but 
it cannot be restored when it is topologically absent. 

• Real systems present correlations between weights and degree, thus it is reasonable to include 
correlations also in the expression for the edge transition probability. In this case, it is possible 
to restore a finite percolation threshold, even when it is topologically absent. 



Chapter 5 



Naming Game: a model of social 
dynamics on networks 

5.1 Introduction 

In Chapters 13141 we have studied the structural and functional properties of complex networks, sup- 
porting the phcnomenological analysis of real infrastructure networks with a numerical and analytical 
investigation of theoretical models. In such a framework, dynamical processes on networks, like ex- 
ploration and spreading models, have been principally conceived as tools, by means of which a better 
characterization of the underlying networks properties can be achieved. 

We turn now our attention to a more direct investigation of dynamical phenomena on complex net- 
works, focusing on the field of social networks, and in particular on models of social interactions 
between individuals. 

A theoretical description of social interactions at the individual level is a very complicated task, that 
is studied by sociologists and goes beyond the purpose of any statistical physics approach. Though it 
is not possible to reproduce or predict human decisions and actions, when looking at a global scale, 
social interactions are much simpler to describe. In this case, we do not need to understand all details 
of single individual behaviors, but only few collective aspects and their relation with the other global 
properties of the population under study. In particular, the large scale observation of social interac- 
tions reveals the existence of striking collective phenomena, i.e. the mechanism, based on individual 
decision processes, leading to the appearance of an overall homogeneous behavior or some other global 
property in a population of agents. 

The standard methods of statistical physics are very appropriate to study such collective behaviors, 
neglecting details and retaining only few general ingredients observed in real social interactions. For 
this reason, physicists have put forward a large number of theoretical models of social dynamics, bor- 
rowing a suite of statistical methods from the theory of interacting particles systems |172M9fil|lllll42| . 
The largest part of these models deals with the study of opinion formation and strategic games. For 
instance, the Ising majority rule has found an unexpected field of application in problems of opinion 
formation j751 11791 11251 1126] . On the same subject, many other models have been proposed, such as 
the Voter model [H2 HE3 EH ED E23 H22 , the Sznajd-Weron model 235 , the Axelrod model for 
the dissemination of culture ^3|, and their variants (e.g. the models proposed by Deffuant et al. |94| 
and by Krause and Hegselmann 142 ). Other types of social dynamics involve diffusion or spreading 
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processes (e.g. the diffusion of knowledge [SI], rumors [S3], innovations |214| or ideas [37], etc), and 
the wide field of strategic games (e.g. the minority game [73], the prisoner's dilemma |209| . etc). 
The behavior of these models was usually studied on regular topologies, even though it is clearly inap- 
propriate for representing the topology of social interactions. For this reason, for a long time models 
of social dynamics have been regarded more as academic exercises than as systems with real potential 
applications. The growing interest for networks science has recently led to a better knowledge of the 
topological properties of real social groups (Section 12.31) . and to the formulation of network models 
reproducing such properties fSection l2.4J) . Consequently, many traditional models of social dynamics 
have been reconsidered, in order to be studied in the new, more appropriate, framework of complex 
networks. 

In this chapter, we report the main results of a series of publications 20, 90, 1?T1 19^1 121) . in which we 
have studied a recently proposed model of social dynamics, called Naming Game, investigating its dy- 
namical behavior on both regular topologies and complex networks. As we will see in the next section, 
this model is conceived to grasp the self- organized mechanisms leading to the onset of a communication 
system in groups of individuals and, more in general, to describe decision processes involving pairwise 
interactions and negotiation, like those underlying the opinion spreading. In addition, with respect 
to other models of social dynamics, the evolution rules of the Naming Game present new ingredients, 
whose role will be elucidated along the chapter. They can be summarized in the following points: 

• the introduction of memory in the individual dynamics; 

• the presence of a feedback interaction with an asymmetric interaction rule; 

• an a priori unlimited number of states (or words, opinions, etc) that is dynamically determined 
during the system's evolution. 

The resulting dynamics is rich of interesting properties, some of them strongly depending on the 
underlying topological properties of the system. Hence, the main motivation of the work is that 
of studying the impact of different topological properties on the local and global dynamical patterns 
generated by the Naming Game and on the process leading to the emergence of collective phenomena. 

The present chapter is organized as follows. Section contains the definition of the model and 
the analysis of the mean-field case. A topology in which agents are allowed to interact with all the 
others is not realistic; social environments are, on the contrary, more reliably modeled by means of 
networked structures, that can better account for the disparity of social relations. Thus, in Section f5.3l 
we provide an exhaustive analysis of the Naming Game dynamics on various network models. We 
start with low-dimensional lattices and small-world networks, on which some analytical results are 
available. Then we turn our attention to general complex networks, looking in particular at the effects 
of the node heterogeneity. The last part of the Section l5~3l is devoted to study real complex networks, 
whose dynamics is often surprisingly different to those observed on computer generated networks. 
The internal activity of the agents governing the decision process is described in Section 15.41 while 
the conclusions are exposed in Section l5~51 We refer to the Appendices ICIDI for the analysis of some 
more technical issues that are omitted in the main text. 
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5.2 Naming Game: General features 

Social interactions are based on the existence of a communication system among the agents, who are 
able to understand each other by means of common linguistic patterns or, more generally, by means 
of a common vocabulary of symbols. 

Such a communication system is the result of a self-organized process in which individuals select spe- 
cific symbols (words) and associate them to concepts and ideas (objects). The emergence of a shared 
lexicon inside social groups and communities of people is very likely to be driven by simple criteria, 
like popularity, imitation, negotiation, and agreement. When a new concept is introduced, people 
refer to it using several different names or words. These words start spreading among the population, 
competing one against the other, until the choice of one of them is taken (with a sudden transition 
or with a long process) and everyone uses the same word (or symbol, etc) (1661 1561 1145| . This kind of 
dynamics has recently become of broad interest after the diffusion of a new generation of web-tools 
which enable human users to self-organize a system of tags in such a way to ensure a shared classi- 
fication of information about different arguments (see, for instance, del.icio.us or www.flickr.com and 
Refs. |72l I133p . Another application concerns global coordination problems in artificial intelligence, 
where a group of artificial embodied agents moving in an unknown environment have to exchange in- 
formations about the objects they gradually discover. The emergence of consensus about the objects 
names allows to establish a communication system. A practical example of this type of dynamics is 
provided by the well-known Talking Heads experiment 23U1 I231] . in which embodied software agents 
develop their vocabulary observing objects through digital cameras, assigning them randomly chosen 
names and negotiating these names with other agents. 

In other words, the process leading to the emergence of a communication system in a population 
of agents (e.g. a social network) is an example of social collective phenomenon with polarization of 
individual opinions or ideas. On the base of these observations, a new field of research called Semiotic 
Dynamics has been developed |232| . that investigates by means of simple models how (linguistic) con- 
ventions originate, spread and evolve over time in a population of agents endowed with simple internal 
states and local pairwise negotiation interactions. 

The fundamental model of Semiotic Dynamics is the so-called Naming Game |229| . in which a popu- 
lation of agents, interacting by pairwise negotiation rules, try to assign a common name to an object. 
Moreover, this model can be studied as an alternative model of opinion formation, since in place of 
names to be assigned to an object we can think to the competition of different opinions on a given 
topic. 

The next section is devoted to the description of the Naming Game model, while the subsequent one 
discusses the mean-field case. 

5.2.1 The Model 

A minimal model of Naming Game has been put forward by Baronchelli et al. in Ref. [221 to reproduce 
the main features of Semiotic Dynamics and the fundamental results of adaptive coordination observed 
in the Talking Heads experiment. The minimal Naming Game model consists of a population of N 
agents observing a single object, for which they invent names that they try to communicate to one 
another through pairwise interactions, in order to reach a global agreement. The agents are identical 
and dispose on an internal inventory, in which they can store an a priori unlimited number of names 
(or opinions). All agents start with empty inventories. At each time step, a pair of neighboring agents 
is chosen randomly, one playing as "speaker" , the other as "hearer" , and negotiate according to the 
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Figure 5.1: Agents interaction rules. Each agent is described by its inventory, i.e. the repertoire of 
known words. The speaker picks up at random a name in its inventory and transmits it to the hearer. 
If the hearer does not know the selected word the interaction is a failure (top), and it adds the new 
name to its inventory. Otherwise (bottom), the interaction is a success and both agents delete all 
their words but the winning one. Note that if the speaker has an empty inventory (as it happens at 
the beginning of the game), it invents a new name and the interaction is a failure. 

following rules (see also Fig. 15.1(1 : 

• the speaker selects randomly one of its words and conveys it to the hearer; 

• if the hearer's inventory contains such a word, the two agents update their inventories in order 
to keep only the word involved in the interaction (success); 

• if the hearer does not possess the uttered word, the latter is added to those already stored in 
the hearer's inventory (failure), i.e. it learns the word. 

Note that the time unit corresponds to the pairwise interaction, in contrast to usual statistical me- 
chanics models in which it corresponds to N interactions. However, in order to compare the results 
for the present dynamical rule with well-known results of other models, in many cases, we will use a 
rescaled time t/N. 

Before entering in the detailed description of the dynamics, it is worthy to specify some visible differ- 
ences of the Naming Game with other commonly studied models of social dynamics and, in particular, 
of opinion formation [221 IP1 PHA l?5l IT791 IT72] . 

• Each agent can potentially be in an infinite number of possible discrete states (words, names, 
opinions), and the maximum number of states depends on the dynamical evolution itself. This is 
in contrast with traditional models (Voter, Potts, etc) in which the number of states is a fixed 
external parameter taking finite (and usually small) values. 

• The two-steps decision process is realistic: an agent can accumulate in its memory different 
possible names for the object, waiting before reaching a decision. This feature is probably at 



5.2. NAMING GAME: GENERAL FEATURES 



97 



the origin of the surface tension that emerges from the dynamics in low-dimensional lattices (see 
sec. l5.3|) and of the non-poissonian behavior at the agents level, a property that will be discussed 
in Section IB"H 

• Each dynamical step can be seen as a negotiation between speaker and hearer, with a certain 
degree of stochasticity, that is absent in deterministic models such as the Voter model. The 
stochastic component is however of a different nature compared to that of standard Glauber 
dynamics used in majority rule models |129| . since here it comes from an internal selection 
criterion, and involves only the speaker, without affecting the (deterministic) decision process 
of the hearer. 

A second important remark concerns the random extraction of the word in the speaker's inventory. 
Most previously proposed models of semiotic dynamics attempted to give a more detailed representa- 
tion of the negotiation interaction assigning weights to the words in the inventories. In such models, 
the word with largest weight is automatically chosen by the speaker and communicated to the hearer. 
Success and failures are translated into updates of the weights: the weight of a word involved in a 
successful interaction is increased to the detriment of those of the others (with no deletion of words); 
a failure leads to the decrease of the weight of the word not understood by the hearer. An example of 
a model including weights dynamics can be found in Ref. |17()| (and references therein). For the sake 
of simplicity the minimal Naming Game avoids the use of weights, that are apparently more realistic, 
but their presence is not essential for the emergence of a global collective behavior of the system. 
An important point needs to be stressed: while in the original experiments the embodied agents could 
observe a set of different objects, in the minimal Naming Game all agents refer to the same single 
object. This is actually possible only if we assume that homonymy is excluded, i.e. two distinct objects 
cannot have the same name. Consequently, in this model, all objects are independent and the general 
problem reduces to a set of independently evolving systems, each one described by the minimal model. 
In more realistic situations, however, the occurrence of homonymy cannot be neglected. 

5.2.2 Mean-field case 

In the original work on the minimal Naming Game model |22| , Baronchelli et al. have focused on the 
mean-field case, in which the agents are placed on the vertices of a complete graph, corresponding to 
study a population in which all pairwise interactions are allowed. By means of numerical simulations, 
they investigated the overall dynamics of the model, monitoring along the evolution three main global 
quantities: 

• the total number N w (t) of words in the system at the time t (i.e. the total size of the memory); 

• the number of different words Nd(t) in the system at the time i; 

• the average success rate S(t) as function of the time, i.e. the probability, computed averaging 
over many simulation runs, that the chosen agent gets involved in a successful interaction at a 
given time t. 

Note that all these quantities are zero at the beginning of the evolution, when all inventories are empty, 
while they reach a stable value when the system enters the absorbing state corresponding to a global 
agreement. In fact, it is possible to show, using a sort of Liapunov functional, that the mean-field 
model always converges to the absorbing state (consensus) |22| . The consensus state is defined by 
observing Nd — 1 and N w — N (moreover it implies S = 1). 
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Figure 5.2: Evolution of the total number of words N w (top), of the number of different words Nd 
(center), and of the average success rate S(t) (bottom), for a mean-field system (black circles) and 
low-dimensional lattices (ID, red squares and 2D, blue triangles) with N — 1024 agents, averaged 
over 10 3 realizations. The inset in the top graph shows the very slow convergence in low-dimensional 
lattices. 

The temporal evolution of the three main quantities is depicted in Fig. l5.2l fcircles). At the beginning, 
many disjoint pairs of agents interact, with empty initial inventories: they invent a large number 
of different words, that start spreading throughout the system, through failure events. Indeed, the 
number of words decreases only by means of successful interactions. In the early stages of the dynamics, 
the overlap between the inventories is very low, and successful interactions are limited to those pairs 
which have been chosen at least twice. Since the number of possible partners of an agent is of order 
N, it rarely interacts twice with the same partner, the probability of such an event growing as t/N 2 . 
Note that this remark is in good agreement with the initial behavior of the success rate S(t) depicted 
in Fig. 15.21 The initial trend of S(t) (black circles) is linear with a slope of order I/TV 2 (whose correct 
value has been computed in Ref. \22\). 

In this phase of uncorrelated proliferation of words, the number of different words Nd invented by 
the agents grows, rapidly reaching a maximum that scales as O(N). Then Nd saturates, displaying 
a plateau, during which no new word is invented anymore (since every inventory contains at least 
one word). The total number N w of words stored in the system has a similar behavior, but it keeps 
growing after Nd has saturated, since the words continue to propagate throughout the system even if 
no new one is introduced. 

The peak of N w has been shown to scale as 0(N 1 ^) [22], meaning that each agent stores O(N ^) 
words. This peak occurs after the system has evolved for a time t max ~ 0(N 1 - 5 ). In the subsequent 
dynamics, strong correlations between words and agents develop, driving the system to a rather fast 
convergence to the absorbing state in a time t conv ~ 0(N 1 - 5 ). 

The S-shaped curve of the success rate in Fig. 15. 2| summarizes the dynamics: initially, agents hardly 
understand each others (S(t) is very low); then the inventories start to present significant overlaps, so 
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Figure 5.3: Graphical representation of the four possible terms contributing to the rate equation for 
the number of undecided agents and the order of magnitude of the term. Note that, around t max , 
the fraction n u (t)/N ~ 1, and then it decreases to zero; therefore, in principle, all terms play a role 
during the evolution. 



that S(t) increases until it reaches 1, and the communication system is completely set in. Despite the 
apparent simplicity of the dynamics, it is very difficult to study this model analytically, because the 
interaction rule is not totally stochastic (the hearer's moves are deterministic) and the update of the 
inventories after a success is highly non-linear. Nevertheless, the authors of Ref. [22] provided several 
qualitative theoretical arguments in order to explain the main properties of the population's global 
behavior. We consider here only the argument for the scaling behavior of the maximum number of 
words in the system. 

From the simulations, we know that the maximum number of words scales as a power of the population 
size N, therefore let us assume that, at the maximum, the number of words per agent scales as ciV a , 
for a certain positive value of the exponent a, and a constant c ~ 0{1). Note that a < 1, since the 
number of different words is at most O(N). 

In order to determine the value of a, we use the fact that, at the maximum, the first temporal 
derivative of the total number of words vanishes, i.e. — 0. On the other hand, the probability 

that a word chosen by the speaker is present in the inventory of the hearer is approximately in 
absence of correlations. We get the following rate equation |22| . 



dN w (t) 
dt 



1 - 



cN° 



cN a , 



-2cN c 



(5.1) 



The first term at the r.h.s. of Eq. 15.11 is the gain term (in case of a failure), while the second term 
represents the loss of 2cN a words in case of successful interaction. 

At the higher order in powers of N, i.e. neglecting terms decreasing as a — 1, the balance condition 
dN j t ^ — is satisfied if a = 1/2. This results proves that the maximum number of words in the 
system scales as N 3 / 2 . From the same argument we can as well obtain the scaling of the time at which 
the peak occurs, t max ~ iV 3 / 2 . 

The study of the convergence time is much more difficult, and no theoretical arguments have been 
proposed yet (except for some results on the limit of infinite dimensional lattices reported in Sec- 
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Figure 5.4: A naive representation (left) of the mean-field approximation of pairwise interactions: 
we focus on the speaker's behavior, substituting single hearer's dynamics with a mean-field average 
quantity. Therefore, only speakers contributions to the rate equation for n u (t) have to be considered, 
that is provided by only two possible interactions (right). We can neglect the first term, in which the 
speaker has only one word, because it does not change the number of undecided nodes. The second 
term, that diminishes of one unit n u (t), is the only contribution to Ea. l5.2l in this rough approximation. 

tion l5.3.l"|l . It is however possible to write down a very naive argument for the scaling (with the size 
N) of the convergence time, i.e. t conv ~ 0(N 3 / 2 ). This argument does not want to be rigorous, but 
its importance is principally that of showing which kind processes lead the system to a consensus 
state, justifying the observed behaviors. 

Let us consider the number n u (t) of agents with more than one word in the inventory (we denote 
them as undecided agents), around the peak in the number of words n u ~ N, while at the convergence 
n u ~ 0. We can now study the way the system approaches the consensus writing a rate equation for 
this quantity. 

According to the actual pairwise interaction rule of the Naming Game, there are many different contri- 
butions that should be taken into account (see Fig. ESt - Moreover, the negotiation process introduces 
correlations related to the feedback mechanism in case of success (i.e. both the speaker and the hearer 
update the inventory). On the other hand, in a mean-field system, the set of words stored in the in- 
ventories should be approximately the same for all agents, suggesting to neglect correlations and focus 
on the behavior of a single node subjected to the average effect of the rest of the system. Since the 
hearer's interaction rule is strictly deterministic, while the speaker chooses randomly in its inventory, 
it seems more reasonable to consider the (mean-field) average on the hearer's term. 
We can now write down an equation for the number of undecided speakers n u (t). First, each node is 
chosen as speaker or hearer with the same probability, thus along the temporal evolution all nodes will 
play as speakers and the equation can be considered to give a rough but correct approximation of the 
real dynamics. Then, the four types of interactions depicted in Fig. 15. 3| can be grouped in only two 
terms (see Fig. I5.4|) : one in which the speaker is an undecided node, the other in which the speaker 
has only one word into the inventory. The latter case do not contribute to the rate equation for n u (t) , 
because in case of both success and failure the number of undecided speakers does not change. The 
only significant contribution is the second term in Fig. 15.41 in which n u (t) is decreased by 1 with a 
fixed probability. 
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The naive rate equation for the number of undecided agents reads 

dn u (t) n u (t) 



dt N 



S(t) , (5.2) 



in which n ffl is the probability of choosing an undecided agent as speaker, and S(t) is the success 
rate, i.e. the average probability of having a successfull interaction. The functional form of the success 
rate is not theoretically known, but we see from Fig. 15.21 that it varies very slowly after the peak until 
the convergence process sets in, then it shows a super-exponential saturation to 1, being enhanced by 
the decrease of the number of redundant words. Using numerical results for S(t) and n u (t) we have 
checked that the relation in Eg. 15.21 is in fact approximately satisfied. 

According to this picture, Ea. l5.2l sheds light on the type of self-accelerating cascade process leading the 
system to the absorbing state. Just before the peak of N w (t), the success rate increases only linearly 
as t/N 2 , then it slows down before the sudden acceleration leading to the convergence; therefore after 
the peak S(t) ~ 0(l/y/~N). This result can be obtained also recalling that the average probability of 
success (i.e. the success rate) is given by the ratio between the average inventory size and the number 
of different words present in the system. At the beginning of the convergence process, just after the 
peak, Nd(t) is almost constant while the average inventory size is slightly decreasing (starting from 
0(y/~N)), i.e. S(t) ~ 0(1/ \fN). Then both these quantities start to decrease until they reach 1 (at 
the convergence). 

Solving Eg. 15.21 in the temporal range just after t max , where Sit) ~ 0(1/ y/N) varies very slowly, we 
realize that the system has already entered the convergence process, that is approximately exponential 
with characteristic time of order N 3 ^ 2 (justifying the observed scaling). However, this exponential 
decrease immediately triggers the growth of the success rate, so that the global convergence becomes 
a self-enhancing process that results into a super-exponential approach to the absorbing state. 
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5.3 The Role of the Topology 

The mean-field case is, from many points of view, rather unrealistic. The agreement process described 
by the Naming Game model is, indeed, an example of social dynamics, that is likely to take place on 
more realistic topological structures like those characterizing social networks. We expect the topology 
to have an effect on the dynamics, in particular on the time required to reach the absorbing state and 
on the process of words propagation. It is thus interesting to study the Naming Game on different 
topologies, expecially on complex networks, whose structure presents properties that are typically 
observed in social environments. 

Studies of dynamical processes on complex networks showed that the main properties governing the 
overall dynamics are the small- world property and the presence of the hubs in heterogenous networks; 
it is thus important to understand which is the impact of these properties on the dynamics. For this 
reason, we start analyzing, in Section 15.3.11 the behavior of the Naming Game on low-dimensional 
lattices, which we prove to be governed by coarsening dynamics; then we show that the presence 
of topological shortcuts (i.e. the smallworld property) is responsible for a crossover from a slowly 
converging process to a faster one CSection l5.3.2fl . Finally, in Section 15. 3. 31 a survey of the behavior 
of the Naming Game on complex networks is considered, giving special attention to the role played 
by the hubs in heterogeneous topologies. 

The behavior of the Naming Game on different topologies has recently attracted the attention of 
researchers in the field of A.I., who are interested in knowing which topology ensures the best trade-off 
between a fast convergence to consensus and an optimization of the required memory per agent. For 
this reason our work is expected to be relevant for the application of similar models to the description 
and/or modeling of learning processes of robots |155l 1151] , 

5.3.1 Coarsening dynamics on low-dimensional lattices 

When the Naming Game is embedded in a regular e?-dimensional lattice, the agents interact only with 
their 2d neighbors, and the overall dynamical properties turn out to be very different compared to the 
mean-field case. In particular, the time required by the system to reach the global consensus displays 
a different scaling with the size N, and the effective size of the inventories is considerably diminished. 
Actually, the existence of different dynamical patterns are clearly visible in Fig. 15.21 where we have 
reported the curves for the total number of words N w (t), the number of different words Nd(t) and the 
success rate S(t) in the cases of mean- field topology (circles), one-dimensional lattice (squares) and 
two-dimensional lattice (triangles). 

At the early stages of the dynamics, we observe a sharp growth of the success rate, meaning that 
agents easily find a local agreement with their neighbors; then, the dynamics seem to slow down and 
the convergence is reached in a much larger time with respect to the mean-field. We know that, in the 
initial phase, the success rate is equal to the probability that two agents that have already played are 
chosen again, and is proportional to t/E, where E is the number of possible interacting pairs (i.e. the 
edges). In the mean- field, E oc N 2 , while in finite d-dimensional systems E cx Nd, explaining for the 
observed slopes of S(t) (Fig. I5.2|l : this quantity grows N times faster in finite dimensions. At larger 
times, the eventual convergence is much slower in finite dimensions than in mean-field. 
The curves for N w (t) and Nd(t) display in all cases a sharp increase at short times, a maximum for 
a given time t max and then a decay towards the consensus state in which all agents share the same 
unique word, reached at t conv . The short time regime corresponds to the creation of many different 
words by the agents. After a time of order N, each agent has played typically once, and therefore 
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Figure 5.5: Scaling of the time at which the number of words is maximal, and of the time needed to 
obtain convergence, in 1 and 2 dimensions. 



0(N) different words have been invented (typically N/2). In mean-field, each agent can interact with 
all the others, so that it can learn many different words; in contrast, in finite dimensions words can 
spread only locally, and each agent has access only to a finite number of different words. The total 
memory used scales as N, and the time t max to reach the maximum number of words in the system 
scales as N ad , with ot\ = = 1 (Fig- 15.5(1 . No plateau is observed in the total number of distinct 
words since coarsening of clusters of agents soon starts to eliminate words. Furthermore, the system 
reaches the consensus in a time t conv that grows as N^ b with f}\ ~ 3 in one dimension and 02 — 2 in 
two dimensions (while in the mean-field case 0m f — 1.5). 

In Fig. 15. 61 we have represented a typical evolution of agents on a one-dimensional lattice, by displaying 
one below the others a certain number of (linear) configurations corresponding to successive equally 
separated temporal steps. Each agent with a unique word in memory is presented by a colored point, 
while agents with more than one word in memory are shown in black. The figure clearly shows the 
growth of clusters of agents with a unique word by diffusion of interfaces made of agents with more 
than one word in the inventory. The fact that the interfaces remain small is however not obvious a 
priori, and requires a deep analytical investigation that is presented in Appendix IU1 
The results exposed in Appendix O are confirmed by numerical simulations as illustrated in Fig. 15.71 
we have found that the probability V{x,£) to find an interface in position x at time t is a Gaussian 
around the initial position, while the mean-square distance reached by the interface at time t follows 
the diffusion law (x 2 ) — 2D exp t/N, with experimental diffusion coefficient D exp ~ 0.224. 
The dynamical evolution of clusters in the Naming Game on a one-dimensional lattice can be described 
as follows: at short times, pairwise interactions create O(N) small clusters, divided by O(N) thin 
interfaces (see the first lines in Fig. I5.6JI . The interfaces then start diffusing. When two interfaces 
meet, the cluster situated in between the interfaces disappears, and the two interfaces coalesce. Such 
a coarsening leads to the well-known growth of the typical size £ of the clusters as t 1 / 2 . The density 
of interfaces, at which unsuccessful interactions can take place, decays as so that 1 — S(t) also 

decays in the same way. Moreover, starting from an initial configuration in which agents have no 
words, a time N is required to reach an average cluster size of order 1, so that £ grows as \JtjN (as 
also shown in the Appendix lUlbv the fact that the diffusion coefficient is D/N). This remark explains 
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Figure 5.6: Typical evolution of a one-dimensional system (N — 1000). Black color corresponds to 
interfaces (sites with more than one word). The other colors identify different single state clusters. The 
vertical axis represents the time (from top to bottom, 1000 x N sequential steps), the one-dimensional 
snapshots are reported on the horizontal axis. 

the time t conv ~ 0(N 3 ) needed to reach the global agreement, i.e. £ = N. 

This framework can be extended to the case of higher dimensions, starting from the case d = 2. 
Fig. 15.81 shows four different snapshots of the system during the evolution. The interfaces of two- 
dimensional clusters, although quite rough, are well defined and their width does not grow in time, 
which points to the existence of an effective surface tension. We have substantiated this picture with 
two types of measures: following Dornic et al. |101| . one is the measure of the time dependence in 
the implosion of a cluster of a given radius; the other is the numerical computation of the equal-time 
pair correlation function. 

Let us consider the time dependent behavior of the linear size of a droplet, i.e. a cluster with the 
form of a bubble in a sea of sites with a different word. As predicted by the theory of coarsening 
phenomena [M] , the radius R(t) of the bubble decreases as yiijj — at, where a is a coefficient related 
to the surface tension (see Fig. 15.9(1 . The equal-time pair correlation function C(r,t) in dimension 
d = 2 (not shown) can be rescaled using the scaling function C(r,t) ~ /(r/t 1 / 2 ), where r is the linear 
length, indicating that the characteristic length scale £ grows as \/t/N (a time O(N) is needed to 
initialize the agents to at least one word and therefore to reach a cluster size of order 1). This result 
is in agreement with coarsening dynamics for non-conserved fields |54j . In terms of linear length scale 
£, the convergence time t conv corresponds to the time necessary to reach £ = N 1 ^, thus we expect 
tconv ~ N 1+2 / d . This scaling has been verified by numerical simulations in d = 2 and d = 3 (not 
shown). 

Note that the obtained scaling law is the same of coarsening dynamics in non-equilibrium Ising 
models, but an additional N factor comes from the different timescale used (here, it corresponds to a 
single interaction instead of N interactions) . For the presence of surface tension and the evolution rule 
with many possible states, the Naming Game model is similar to Potts-like models, but in the Naming 
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= 2D exp t/N with a coefficient D, 



exp 



0.224 in agreement with the theoretical 



Game the number of states (words, opinions, etc) is determined by the dynamics, not fixed a priori. 
This feature is determinant in the overall behavior of the system, the convergence time depending 
both on the ordering process and the initial words spreading. 

In low-dimensional lattices, the time of the peak in the number of words scales as t max ~ O(N), 
thus it is dominated by the second part of the dynamics, that is considerably slower. Increasing the 
dimension, the peak height and time increases, approximately as N^fd, while the coarsening time 
decreases as N 1+2 / d . For d = 4, the global consensus is reached in a time that scales as for the mean- 
field, but the mean-field behavior is dominated by the scaling of the time of the peak (0(N 3 / 2 )), 
that is smaller (O(N)) in all finite dimensions. Hence, the general non-equilibrium dynamics of the 
Naming Game model in finite dimensional lattices is the result of the interplay between two different 
dynamical regimes (i.e. of creation and elimination of words). 

5.3.2 Crossover to a fast-converging process in small- world networks 

The precise knowledge of the dynamical behavior of the Naming Game model on low-dimensional 
lattices, and in particular on the one-dimensional ring, makes possible to understand, by means of 
simple arguments and numerical simulations, the effect of the small- world property, that is a relevant 
feature of real complex networks. 

In the following, indeed, we investigate the effect of introducing long-range connections which link 
agents that are far from each other on the regular lattice. In other words, we study the Naming 
Game on the small- world model proposed by Watts and Strogatz |247| . The detailed description of 
the model is reported in Section l2.4.1l however we recall that starting from a quasi-one-dimensional 
banded network in which each node has 2m neighbors, the edges are rewired with probability p, i.e. 
p represents the density of long-range connections introduced in the network. For p = the network 
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Figure 5.8: Four snapshots at different times (lexicographic order from the top-left corner) for a two- 
dimensional system. The form of the clusters (composed of agents with a unique word) clearly show 
the presence of an effective surface tension, associated with coarsening dynamics. The width of the 
interfaces (in black) is very small as already seen for the one-dimensional model. 

retains an essentially one-dimensional topology, while the random network structure is approached 
as p goes to 1. At small but finite p (1/N «p« 1), a small- world structure with short distances 
between nodes, together with a large clustering, is obtained. 

When p = 0, the system is one-dimensional and the dynamics proceeds by slow coarsening. At small 
p, the typical distance between shortcuts is 0(l/p), so that the early dynamics is not affected and pro- 
ceeds as in one-dimensional systems. In particular, at very short times many new words are invented 
since the success rate is small. The maximum number of different words scales as O(N), as in the 
other cases, while the average used memory per agent remains finite, since the number of neighbors 
of each site is bounded (the degree distribution decreases exponentially (2H|, see Section \'Z. 4. ljl . 
The typical cluster dynamics on a small- world network is graphically represented in Fig. 15 . 1 01 1 As 
long as the typical cluster size is smaller than 1/p, the clusters are typically one-dimensional, and 
the system evolves by means of the usual coarsening dynamics. However, as the average cluster size 
reaches the typical distance between two shortcuts ~ 1/p, a crossover phenomena toward an accel- 
erated dynamics takes place. Since the cluster size grows as yJt/N, this corresponds to a crossover 
time t cross = 0(N/p 2 ). For times much larger than this crossover, one expects that the dynamics is 
dominated by the existence of shortcuts, entering a mean-field like behavior. The convergence time is 
thus expected to scale as N 3 ^ 2 and not as N 3 . The condition in order for this picture to be possible is 
exactly the small-world condition; indeed, the crossover time N/p 2 has to be much larger than 1, and 
much smaller than the consensus time for the one-dimensional case N 3 , that together imply p ^> 1/A^. 

1 Following the analysis of Section 15.3.11 we call "cluster" a set of neighboring nodes (agents) with the same unique 
word. 
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Figure 5.9: A measure of the surface tension in the two-dimensional Naming Game. The initial 
condition is a configuration (N = L 2 = 400 2 ) with only two clusters, an internal droplet (a bubble) 
of radius Ro = 100 and a surrounding sea with a different word. Initially all agent already possess 
one word, thus no new words are created, but the presence of surface tension at the interface between 
the two clusters provokes the decrease of the bubble's size. We have monitored the normalized area 
nR 2 (t)/L 2 as a function of the time. According to a coarsening dynamics, the radius R(t) decreases 
as y/i. Note that the linear trend is very clear even if the data refer to a single realization. 

In summary, the small-world topology allows to combine advantages from both finite dimensional lat- 
tices and mean-field networks: on the one hand, only a finite memory per node is needed, in opposition 
to the 0{N 1 / 2 ) in mean-field; on the other hand the convergence time is expected to be much shorter 
than in finite dimensions. The theoretical predictions have been verified monitoring the behavior of 
the usual global quantities. Figure 15 . 1 1 1 displays the evolution of the average number of words per 
agent as a function of time, for a small-world network with average degree (k) = 8, and various values 
of the rewiring probability p. While N w (t) in all cases decays to N, after an initial peak whose height 
is proportional to N, the way in which this convergence is obtained depends on the parameters. At 
fixed N, for p = a power-law behavior N w /N — 1 oc is observed due to the one-dimensional 

coarsening process. As soon as p > l/N however, we observe deviations getting stronger as p is 
increased: the decrease of is first slowed down after the peak, but leads in the end to an very fast 
convergence. The effect is more evident for larger p. Moreover, increasing the size of the system the 
convergence gets slower. 

As previously mentioned, a crossover phenomenon is expected when the one-dimensional clusters 
reach sizes of order 1/p, i.e. at a time of order N/p 2 . Since the agents with more than one word 
in memory are localized at the interfaces between clusters, their number is O(Np). The average 
excess memory per site (with respect to global consensus) is thus of order p, so that one expects 
N w /N— 1 = pg(tp 2 /N). Figure l5~T2l A indeed shows that the data of {N w /N -l)/p for various values 
of p and N collapse when tp 2 /N is of order 1. On the other hand, Fig. I5.121 B indicates that the 
convergence towards consensus is reached on a timescale of order N° sw , with f3sw ~ 1.4 ± 0.1, close 
to the mean-field case A 3 / 2 and in strong contrast with the A 3 behavior of purely one-dimensional 
systems. Note that the time to converge scales as p~ 1A±A , that is consistent with the fact that for 
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Figure 5.10: A naive representation of clusters growth in the small- world model of Watts and Strogatz. 
A cluster (in red) starts to expand locally by coarsening dynamics like in dimension one. When the 
size of the cluster is of the order of the average distance between shortcuts, long-range interactions 
take place. The effect of these long-range interactions is that of boosting up the dynamics. 



p of order 1/N one should recover an essentially one-dimensional behavior with convergence times of 
order N 3 . 

In the small-world regime, the system develops a plateau in the total number of words (after the peak 
and before the convergence), whose duration increases with the size N. However, during this plateau 
the system evolves continuously towards consensus by elimination of redundant words, as evidenced 
by the continuous decrease in the number of distinct words displayed in Fig. I5.131 A. It shows that 
curves for various system sizes and values of p collapse when correctly rescaled around the crossover 
time N/p 2 . 

The combination of the results concerning average used memory and number of distinct words cor- 
responds to a picture in which clusters of agents sharing a common unique word compete during the 
time lapse between the peak and the final consensus. It is thus interesting to measure how the average 
cluster size evolves with time and how it depends on the rewiring probability p. Figure 15.1 31 - B allows 
to compare the cluster size (s) evolution for the one-dimensional case and for finite p. At p — 0, a pure 
coarsening law (s) oc \ft is observed. As p increases, deviations are observed when time reaches the 
crossover p 2 /N, at a cluster size 1/p, as was expected from the intuitive picture previously developed. 
As expected, the collapse of the curves of (s}p vs. tp 2 /N takes place for tp 2 /N of order 1. 
Another interesting remark concerns the slowing down of the curves after the peak and before the 
convergence. This is possibly related to the first interactions between clusters and shortcuts. When 
a cluster touches a shortcut, the presence of a branching point slows down the interface movement. 
Thus, the clusters are locally more stable, due to the presence of an effective 'pinning' of interfaces 
near a shortcut. This effect is reminiscent of what happens for the Ising model on small- world net- 
works (52j where, at low temperature, the local field transmitted by the shortcuts delays the passage 
of interfaces. Unlike Ising's zero temperature limit, however, the present dynamics only slows down 
and is never blocked into disordered configurations. The idea of interfaces pinning at nodes playing 
as branching points is discussed in more detail for tree structures in the next section. 
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Figure 5.11: Average number of words per agent in the system, N w /N as a function of the rescaled 
time t/N, for small- world networks with (fc) = 8 and N = 10 3 nodes, for various values of p. The 
curve for p = is shown for reference, as well as p = 5.10~ 3 , p = 10~ 2 , p = 2.10 -2 , p — 4.10~ 2 , 
p = 8.10~ 2 , from bottom to top on the left part of the curves. 

5.3.3 Naming Game on complex networks 

In this section we expose the main results on the dynamics of the minimal Naming Game model on 
complex networks. Before entering into the details of the analysis, it is worth noting that the minimal 
Naming Game model itself, as described in Section [5.2. II is not well-defined on general networks. On 
regular topologies, the number of neighbors of a node is fixed, thus any possible method used to choose 
at random a pair of neighboring agents to interact is completely equivalent. In a general network, 
on the contrary, different nodes possess a variable number of neighbors, therefore it is important to 
take notice of the strategy used to draw the pairs of agents in interaction, since it can produce a 
different dynamical behavior. When choosing a pair, one should specify which player is chosen first, 
the speaker or the hearer. 

On a generic network with degree distribution P(k), the degree of the first chosen node and of its 
chosen neighbor are distributed respectively according to P(k) and to kP(k) / (k). The second node will 
therefore have typically a larger degree, and the asymmetry between speaker and hearer can couple to 
the asymmetry between a randomly chosen node and its randomly chosen neighbor, leading to different 
dynamical properties. This aspect of the dynamical processes evolving on irregular topologies and 
networks has been first noticed by Suchecki et al. |284| and Castellano et al. |68l EH] m the case of 
the Voter model. This is particularly relevant in heterogeneous networks, in which the neighbor of a 
randomly chosen node is likely to be a hub. 

This remark on the asymmetry of dynamical processes on networks suggests to define three different 
pair selection strategies for the Naming Game: 

• A randomly chosen speaker selects randomly a hearer among its neighbors. This is probably the 
most natural generalization of the original rule. We call this strategy direct Naming Game. In 
this case, larger degree nodes will preferentially act as hearers. 

• The opposite strategy, called reverse Naming Game, can also be carried out: we choose the 
hearer at random and one of its neighbors as speaker. In this case the hubs are preferentially 
selected as speakers. 
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Figure 5.12: (A) Rescaled curves of the average number of words per agent in the system, in order 
to show the collapse around the crossover time N/p 2 . For each value of p, two values of the system 
size (N = 10 4 and N — 10 5 ) are displayed. The curves for different sizes are perfectly superimposed 
before the convergence. (B) Convergence at large times, shown by the drop of N w /N — 1 to 0: the 
time is rescaled by N 1A . For eachp, three different sizes (N\ = 10 3 for the left peak, curves in black, 
N2 = 10 4 for the middle peak, curves in blue, and N3 = 10 5 , right peak, curves in red) are shown. On 
the N scale, the convergence becomes more and more abrupt as N increases. The inset displays the 
convergence time as a function of size for p = (bullets), p = 0.01 (squares), p — 0.02 (diamonds), 
p = 0.04 (triangles), p = 0.08 (crosses); the dashed lines are proportional to A^ 3 and N 1A . 

• A neutral strategy to pick up pairs of nodes is that of considering the extremities of an edge 
taken uniformly at random. The roles of speaker and hearer are then assigned randomly with 
equal probability among the two nodes. 

Figure IB*. 141 allows to compare the evolution of the direct and the reverse Naming Game for a heteroge- 
neous network, a Barabasi- Albert (BA) network with N = 10 4 agents and (k) = 4. In the case of the 
reverse rule, a larger memory is used although the number of different words created is smaller, and 
a faster convergence is obtained. This corresponds to the fact that the hubs, playing principally as 
speakers, can spread their words to a larger fraction of the agents, and remain more stable than when 
playing as hearers, enhancing the possibility of convergence. Similarly to the case of the Voter model 
|(S8I the scaling laws of the convergence time for direct and reverse strategies seem to be the same 
only in some very special cases (power-law degree distribution with exponent 7 = 3); however, we do 
not dispose of an accurate study of the reverse NG properties. This is due to the fact that from the 
point of view of a realistic interaction among individuals or computer-based agents, the direct Naming 
Game, in which the speaker chooses a hearer among its neighbors, seems somehow more natural than 
the other ones. For this reason we have focused on the direct Naming Game. In future work we will 
study more in detail the similarities and differences of the three strategies. 

Global quantities - As already done for the other topologies, we study the global behavior 
of the system looking at the temporal evolution of three main quantities: the total number N w (t) of 
words in the system, the number of different words Nd(t), and the rate of success S(i). In Fig. 15.151 
we report the curves of the number of words (N w (t) and Nd(t)) for homogeneous ER networks (left) 
and heterogeneous BA networks (right) with N = 10 3 , 10 4 , 5 • 10 4 nodes and average degree (fc) = 4. 
The corresponding data for the mean-field case (with N = 10 3 ) are displayed as well for reference. 
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Figure 5.13: (A) Number of different words in the system as a function of time for (k) = 8, p = 
1CP 2 , 2.10~ 2 , 4.10~ 2 , 8.10~ 2 and increasing sizes. Data have been rescaled in order to collapse the 
curves around the crossover time N/p 2 . Two values of the system size (N = 10 4 and N = 10 5 ) are 
displayed for each p. (B) Evolution of the cluster size for TV = 10 4 and various values of p. The curves 
are rescaled around the crossover region. 



The curves for the average use of memory N w (t) show a rapid growth at short times, a peak and then 
a plateau whose length increases as the size of the system is increased (even when time is rescaled by 
the system size, as in Fig. I5.15|l . The time and the height of the peak, and the height of the plateau, 
are proportional to N. A systematic study of the scaling behavior with the size of the system for these 
quantities is reported in Fig. 15.161 which shows that the convergence time t conv scales as N" with 
(3 ~ 1.4 for both ER and BA. The apparent plateau of N w does however not correspond to a steady 
state, as revealed by the continuous decrease of the number of different words N4 in the system: in 
this re-organization phase, the system keeps evolving by elimination of words, although the total used 
memory does not change significantly. 

Note that observed scaling laws for the convergence time is a general robust feature that is not 
affected by other topological details (average degree, clustering, etc), and more surprisingly it seems 
to be independent of the particular form of the degree distribution. We have indeed checked the value 
of the exponent (3 ~ 1.4 ± 0.1 for various (k), clustering, and exponents 7 of the degree distribution 
P(k) ~ fc -7 for scale-free networks constructed with the uncorrelated configuration model. These 
parameters have instead an effect on other quantities such as the time and the value of the maximum 
of memory (that will be analyzed later). 

The ubiquity of the scaling exponent (3 ~ 1.4 could be related to the fact that all these networks 
present the small- world property. In many equilibrium and non-equilibrium statistical physics models 
defined on general networks, the small- world property is sufficient to ensure a mean-field like behavior 
of the system. In the present case, the discrepancy with the mean- field exponent ((3m f — 1-5) may 
be due to logarithmic corrections that are unlikely to be captured using numerical scaling techniques. 
Comparing Figures . 1 515 . 1 6l with those for the mean-field (MF) topology and the regular lattices re- 
ported in Sections 15 .'3 . 115 . some important analogies and differences emerge. Thanks to the finite 
average connectivity, the memory peak scales only linearly with the system size N, and is reached 
after a time O(N), in contrast with MF (©(TV 1 - 5 ) for peak height and time) but similarly to the finite 
dimensional case. The MF plateau observed in the number of different words, is replaced here by 
a slow continuous decrease of Nd with an almost constant memory used. With respect to the slow 
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Figure 5.14: Total memory (top) and number of different words Nd (bottom) vs. rescaled time 
for two different strategies of pair selection on a BA network of N = 10 4 agents, with (k) = 4. The 
reverse NG rule (black full line) converges much faster than the direct rule (red dashed line). Note 
that the two strategies do not lead to the same scaling laws with the system size for the convergence 
time (not shown). 

coarsening process observed in finite dimensional lattices on the other hand, the existence of short 
paths among the nodes speeds up the convergence towards the global consensus. Therefore, com- 
plex networks exhibiting small- world properties constitute an interesting trade-off between mean-field 
"temporal efficiency" and regular lattice "storage optimization". 

The success rate S(t) is displayed in Figure 15. 171 - A for ER (top) and BA (bottom) networks with 
N = 10 3 (red full line), and 10 4 (blue dashed line) agents and (k) = 4. The success rate for the 
mean-field (N = 10 3 ) is also reported (black dotted lines). In both networks the success rate increases 
linearly at very short times (see also Fig. 15.17T B') then, after a plateau similar to the one observed for 
N w , it increases on a fast timescale towards 1. At short times most inventories are empty, so that the 
success rate is equal to the probability that two agents interact twice, i.e. t/E, where E = N(k)/2 
is the number of possible interacting pairs. The curves, for BA networks in Fig. 15.17T B. give slopes 
in agreement with the theoretical prediction 2/(k)N. Compared to the mean-field case, in which 
E = 0(N 2 ), the initial success rate grows faster. When t ~ 0(N), no inventory is empty anymore, 
words start spreading through unsuccessful interactions and S(t) displays a bending. 

Clusters statistics - Without entering the detailed analysis of the behavior of clusters of 
words, for which we refer to Ref. [!JT]. it is worthy to spend some words on this aspect of the Naming 
Game dynamics. We have called "cluster" any set of neighboring agents sharing a common unique 
word. In Section [5.3.11 we have shown that, in low-dimensional lattices, the dynamics of the Naming 
Game proceeds by formation of such clusters, that grow through a coarsening phenomenon: the aver- 
age cluster size (resp. the number of clusters) increases (resp. decreases) algebraically with time. As 
shown instead in Fig. 15.181 for both models (ER and B A) the normalized average cluster size remains 
very close to zero (in fact, of order l/N) during the re-organization phase that follows the peak in the 
number of words, and converges to one with a sudden transition. The same behavior is shown also by 
the number of clusters N c i(t), that decreases to one very sharply. 
The emerging picture is not that of a coarsening or growth of clusters, but that of a slow process of 
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Figure 5.15: ER random graph (left) and BA scale-free network (right) with (k) — 4 and sizes 
N = 10 3 , 10 4 , 5.10 4 . Top: evolution of the average memory per agent N w /N versus rescaled time t/N. 
For increasing sizes a plateau develops in the re-organization phase preceding the convergence. The 
height of the peak and of the plateau collapse in this plot, showing that the total memory used scales 
with N. Bottom: evolution of the number of different words in the system. (Nd — 1)/N is plotted 
in order to emphasize the convergence to the consensus with Nd — I. A steady decrease is observed 
even if the memory N w displays a plateau. The mean-field (MF) case is also shown (for N — 10 3 ) for 
comparison. 



correlations between inventories, followed by a multiplicative process of cluster growth triggered by 
a sort of symmetry breaking event in the success probability of the words (in favor of the word that 
will ultimately survive). 

Effect of the degree heterogeneity - Global properties of dynamical processes are often 
affected by the heterogeneous character of the network topology [2031 ll()2j . We have shown however 
that the global dynamics of the Naming Game is similar on heterogeneous and homogeneous networks. 
Nonetheless, a more detailed analysis reveals that agents with different degrees present very different 
activity patterns, whose characterization is necessary to get additional insights on the Naming Game 
dynamics |3T1I32] . 

Let us first consider the average success rate Sk{t) of nodes of degree k; at the early stages of the 
dynamics it can be computed using simple arguments. The probability of choosing twice the edge 
ihj) is 

t ( l l A / v 

(5.3) 



N \ki k 3 

i.e. the probability of choosing first i (l/N) then j (1/fcj) or viceversa. Neglecting the correlations 
between fej and kj, one can average over all nodes i of fixed ki — k, obtaining 

w«Ui+(i))- <m> 



N \k \k_ , 

Ea. 15.41 show that, at the very beginning, the success rate grows linearly but the effect of the degree 
heterogeneity is partially screened by the presence of the constant term (1/k). The same argument 
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Figure 5.16: (Top) Scaling behavior with the system size N for the time of the memory peak (t max ) 
and the convergence time (t conv ) for ER random graphs (left) and BA scale-free networks (right) with 
average degree (k) — 4. In both cases, the maximal memory is needed after a time proportional to 
the system size, while the time needed for convergence grows as N@ with j3 ~ 1.4. (Bottom) In both 
networks the necessary memory capacity (i.e. the maximal value reached by N w ) scales linearly with 
the size of the network. 



can be used to predict that the success rate should be essentially degree independent for larger times. 
S(t) is indeed always given by two terms, of which only that referring to the node playing as speaker 
contains an explicit dependence on 1/k. 

Another interesting point concerns the memory peak. In Fig. 15.191 we have computed the height of 
the memory peak reached by different classes of nodes, depending on the degree, and we have found 
that it is larger for nodes of larger degree. More precisely, the maximal memory used by a node 
of degree k is proportional to yk (see bottom panel in Fig. I5.19[) . This is in agreement with what 
already observed for the mean-field case, in which all agents have degree k = N — 1 and the maximal 
value of the total memory N w scales indeed as Ny/k = N 3 / 2 . Note however that in the general case, 
the estimation of the peak of N w is not as straightforward. This peak is indeed a convolution of the 
peaks of the inventory sizes of single agents, that have distinct activity patterns and may reach their 
maximum in memory at different temporal steps. 

The knowledge of the average maximal memory of a node of degree k is not sufficient to understand 
which degree classes play a major role in driving the dynamics towards the consensus. More insights 
on this issue can be obtained observing the behavior of the total number of different words in each 
degree class. A detailed analysis is reported in Ref. [Hj, in which we show that low degree classes 
have a larger overall number of different words. This is due to the fact that during the initial phase, 
in which words are invented, low degree nodes are more often chosen as speakers and invent many 
different words. The hubs need individually a larger memory, but as classes they retain a smaller 
number of different words. Then, words are progressively eliminated among low-fc nodes while the 
hubs, which act as intermediaries and are in contact with many agents, still have typically many words 
in their inventories. In this sense, the "super-spreader" role of the hubs allows a faster diffusion of 
words throughout the network and their property of connecting agents with originally different words 
helps the system to converge. The very final phase consists in the late adoption of the consensus by 
the lowest degree nodes, in a sort of final cascade from the large to the small degrees. 
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Figure 5.17: (Left) Temporal evolution of the success rate for ER random graphs (red continuous 
line) and BA scale-free networks (blue dashed line) with (k) — 4 and sizes N = 10 3 and 10 4 . The 
dotted black line refers to the mean-field case (N = 10 3 ). (Right) BA network, N = 10 3 . The short 
time behavior of the success rate S(t) is shown for (k) = 4 (circles), (k) — 8 (squares), and (k) = 16 
(triangles). The curves are linear, with a slope that is in agreement to the predicted value 2/(k)N. 



Effect of the average degree and clustering - Social networks are generally sparse graphs, 
but their structure is often characterized by high local cohesiveness, that is the result of a very natural 
transitive property of social interactions |1H7| . The simplest way to take into account these features 
on the dynamics of Naming Game is that of studying the effects of changing the average degree and 
the clustering coefficient of the network. Fig. 15.21)1 displays the effects of increasing the average degree 
on the behavior of the main global quantities. In both ER (left) and BA (right) models, increasing 
the average degree provokes an increase in the memory used, while the global convergence time is 
decreased. More precisely, the dependence of the height N™ ax and the time t max of the memory 
peak as function of (k) with fixed population is approximately power-law, with sub-linear behavior 
|91j . This remark suggests that the linear scaling for the memory peak properties (N™ ax oc N a and 
tmax oc N a with a = 1) are altered by an increase in the average degree (not shown), as expected 
by the fact that increasing (k) brings the system closer to the mean-field behavior where the scaling 
of these quantities is non- linear («mf = 1-5). It is remarkable that the behavior of the convergence 
time with N (i.e. a power-law N@ with /3 ss 1.4) is instead very robust. This is possibly due to the 
fact that, in contrast with the power-law dependence of the peak, the convergence time depends only 
logarithmically on the average degree. 

Note also that the average memory used by a node of fixed degree is larger for larger average degree 
(not shown), therefore such a global argument can also be extended to a degree based analysis. The 
curves of the success rate (not shown) are consistent with the previous analysis. 

We are now interested to the effects due to the variation of the clustering coefficient. First, the 
clustering is slightly changing when changing the average degree, but its variation is small enough 
for the two effects to be studied separately. Here we use some other mechanisms to enhance clus- 
tering, summarized in the following two models with tunable clustering: the clustered Erdos-Renyi 
(CER) random graphs, and mixed BA-DMS model. These networks have been compared to ER and 
BA networks with the same size and average degree. The mixed BA-DMS model is obtained as a 
generalization of the preferential attachment procedure, in the spirit of the Holme-Kim model [143] : 
starting from m connected nodes (with m even), a new node is added at each time step; with prob- 



116 



CHAPTER 5. THE NAMING GAME 




Figure 5.18: Number of clusters N c i and normalized average cluster size (s)/N vs. time for ER 
networks (right) and BA networks (left) with N = 10 4 , (k) = 4 (circles), (k) = 8 (squares), (k) = 16 
(crosses). 

ability q it is connected to m nodes chosen with the preferential attachment rule (BA step), and 
with probability 1 — q it is connected to the extremities of m/2 edges chosen at random (DMS-likc 
step). Only the clustering spectrum is different with respect to BA and DMS, it can be computed as 
c(k) = 2(1 — q)(k — m)/[k(k — 1)] + 0(1/N) Changing m and q allows to tune the value of the 
clustering coefficient. 

Since the ER model also displays a low clustering, we consider moreover a purposely modified version 
of this random graph model (Clustered ER, or CER model) with tunable clustering. Given N nodes, 
each pair of nodes is considered with probability p\ the two nodes are then linked with probability 
1 — Q while, with probability Q, a third node (which is not already linked with either) is chosen 
and a triangle is formed. The clustering is thus proportional to Q (with p ~ 0(1/ N) we can ne- 
glect the original clustering of the ER network) while the average degree is approximately given by 
(k) ~ [3Q + (1 - Q)]pN ~ (2Q + l)pN Note that, in order to compare an ER and a CER network 
with the same (k), we therefore tune p for the construction of the corresponding CER. 
Figure 15.211 shows the effect of increasing the clustering at fixed average degree and degree distribu- 
tions: the number of different words is not changed, but the average memory used is smaller and the 
convergence takes more time. Moreover, the memory peak at fixed k is smaller for larger clustering 
(not shown) : it is more probable for a node to speak to 2 neighbors that share common words because 
they are themselves connected and have already interacted, so that it is less probable to learn new 
words. At fixed average degree, i.e. global number of links, less connections are available to transmit 
words from one side of the network to the other since many links are used in "local" triangles. The 
local cohesiveness is therefore in the long run an obstacle to the global convergence. This effect is 
similar to the observation of an increase in the percolation threshold in clustered networks, due to the 
fact that many links are "wasted" in redundant local connections |223j . 

Effect of hierarchical structures - In the previous sections we have argued that networks 
with small- world property have fast (mean-field like) convergence after a re-organization phase whose 
duration depends on other properties of the system. The small-world property holds when the diam- 
eter of the network grows slowly, i.e. logarithmically or slower, with the size N . However, another 
requirement is necessary: the small-world must be generated by shortcuts connecting regions of the 
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Figure 5.19: BA model with m = 2 (i.e. (fc) = 4), N = 5.10 4 . (Bottom) Maximum memory used 
by a node as a function of its degree. The dashed line is oc Vk- (Top) Average memory used by 
nodes of degree k, for various values of k. The lines show the total memory N w (k,t) used by nodes 
of degree k at time t, normalized by the number Nk of nodes of degree k. The circles correspond to 
the bottom curve (fc = 5) rescaled by yfe/feo showing the scaling of the peaks. Note that the values 
of N w (k,t)/Nk are averages over many runs that wash out fluctuations and therefore correspond to 
smaller values than the extremal values observed for N™ ax (k). 



network that are otherwise far away one from the other. From this point of view, shortcuts correspond 
to the existence of long loops (see Fig. I5.22T AL When shortcuts, and corresponding long loops, are 
absent the topological structure of the network possess an intrinsic metric ordering. In such a situa- 
tion, regular structures like d-dimensional lattices admit a real geometric distance, whereas disordered 
topologies are more generally associated to hierarchical structures. 

On a hierarchical network, each node belongs in fact to a given sub-hierarchical unit and for going 
from one node to another node in another sub-unit, it is necessary to follow a hierarchical path. In 
the Naming Game, each sub-unit can converge towards a local consensus, which makes the global 
consensus more difficult to achieve (see Fig. I5.221 BL In other words, the dynamics slows down in the 
passage between different levels of hierarchy, with a behavior that resembles that observed in models 
of glassy dynamics with traps (SUJ or in "hierarchical islands models" of diffusion in turbulent flows 
|253| . Note however that, unlike the Naming Game, in these models there are real energy barriers 
obstructing the dynamics. The results of numerical simulations on network models with a strong 
hierarchical structure are striking: the Naming Game converges very slowly, the number of different 
words decreasing as a power law of the time (in Fig. 15.251 we have reported N w (t) /N — 1 for several 
hierarchical networks). The presence of hierarchy in a network is usually hard to quantify, but in 
some cases, such as those of the networks represented in Fig. 15.231 it is implicitly introduced in the 
generating procedure: 

A. Regular or scale-free trees are clearly hierarchical structures (we have checked the behavior 
of Cayley trees and BA scale-free trees obtained, as sketched in Fig. I5.23I A. by means of a 
preferential attachment rule with m = 1 |17p: 

B. For the DMS model with m = 2 one adds at each step a new node which is connected 
to the extremities of a randomly chosen edge, thus the causal structure of the tree introduces a 
hierarchy (see Fig. I5.23I -B'): 
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Figure 5.20: ER networks (left) and BA networks (right) with N — 10 4 agents and average degree 
(k) = 4, 8, 16. The increase of average degree leads to a larger memory used (N w , top) but a faster 
convergence. The maximum in the number of different words is not affected by the change in the 
average degree (bottom). 



C. The deterministic scale- free networks are built starting with two nodes connected to a root. At 
each temporal step n, two units (of 3 n_1 nodes) identical to the network formed at the previous 
step are added, and each of the bottom 2™ nodes are connected to the root JUj fFig. I5.23T C1: 

D. The Random Apollonian Networks (RAN) |l()l I254| are embedded in a two-dimensional plane. 
One starts with a triangle; a node is added and connected to the three previous nodes; at each 
step a new node is added in one of the existing triangles (chosen at random) and connected to 
its three edges, replacing the chosen triangle by three new smaller triangles (Fig. I5.23"I DV 

In the particular case of tree structures, the power-law decay can be justified with a more precise qual- 
itative argument. In general, from the viewpoint of the Naming Game dynamics, a tree is formed by 
two ingredients: linear structures on which the interfaces between clusters diffuse as in one-dimensional 
systems and branching points at which the motion of interfaces slows down. Following the arguments 
used in Section [5.3.11 and in Appendix [U] on a linear structure we can model the motion of the inter- 
faces between clusters of words as random walks. At branching points, however, the interfaces can in 
principle interact with more than one cluster, thus the effective hopping probability is decreased or, 
in other terms, the mean waiting time between two successive steps increases. The average waiting 
time can be computed as the inverse of the stationary probability for the local configurations of the 
interface (as for a classic escape-over-a-barrier problem jZlj). In principle, such probabilities can be 
obtained by solving a truncated Markov chain for the transition rates for all possible moves of the 
interface at the branching point (as we have done for one-dimensional systems in Appendix EJ- The 
computation is actually very demanding even in simple situations such as that of an interface going 
across a node of degree 3 (we have reported in Fig. 15.241 an example containing some of the transitions 
one should take into account). However, from simple examples, we expect that increasing the degree 
corresponds to a stronger effective pinning of the interfaces and larger waiting times. 
The above qualitative argument explains the diffusive behavior observed on regular trees, such as the 
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Figure 5.21: Effect of clustering on the behavior of the total number of words N w (t) and of the number 
of different words Nd(t) on random graphs (left) and scale-free networks (right) with N = 10 4 . The 
considered clustered random graphs (CER model, with clustering coefficient proportional to Q) have 
been compared to standard ER graphs with equal average degree ((k) =6 and 10). Scale- free networks 
have been generated using the mixed BA-DMS model, in which the clustering coefficient is proportional 
to 1 — q. In both networks higher clustering leads to smaller memory capacity required but a larger 
convergence time. 



Cayley tree; on the other hand, on scale-free trees, such as the BA networks with m = 1, the behavior 
is even slower, with a clearly subdiffusive exponent. 

In the theory of random walks, subdiffusion can be associated with an anomalously long waiting time 
between successive walks |51l I14fil 12481 1253"] . According to our picture, this is exactly what happens 
in a scale-free tree, where the degree of the nodes is broadly distributed. The interfaces make random 
walks but they may be suddenly pinned at some branching points with waiting times that depend on 
the degree of the nodes they try to by-pass. Consequently, the heterogeneity in the degree induces 
that of the waiting times and the corresponding subdiffusive behavior. 

Note that this argument holds only for trees, not for general hierarchical structures, whose local dy- 
namics is more complicated preventing us from a detailed analysis. 

Community structures - In contrast with other non-equilibrium models, as those based 
on zero-temperature Glauber dynamics or the Voter model |ll)H 1521 12341 171)1 R)§| , we do not find 
any signature of the occurrence of metastable blocked states in any relevant topology with quenched 
disorder. Even when in several cases the total number of words displays a plateau whose length 
increases with the system size, the number of different words is continuously decreasing, revealing 
that the convergence is not triggered by fluctuations due to finite-size effects, but it is the result of 
an evolving self-organizing process. Such behavior makes the Naming Game a robust model of self- 
coordinated communication in any structured population of agents. 

Finding topological properties ensuring the existence of metastable states or blocked configurations 
seems to be a non-trivial and intriguing task, that we try to investigate starting from the following 
main remarks: 
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Figure 5.22: (A) Sketch of a shortcut connecting two distant regions and reinterpretation as a long 
loop. (B) Naive representation of a typical clusters organization in a hierarchical structure. 

• the model displays slow coarsening dynamics whenever there is the possibility of cluster forma- 
tion, i.e. when topological constraints are strong enough to prevent words propagation; 

• highly clustered regions and cliques of nodes rapidly find a local consensus; 

• at the interfaces between clusters, an effective surface tension is generated; 

• close to bottlenecks the surface tension may increase, causing the ordering process to slow down. 

According to this analysis, reasonable candidates for observing metastable states are networks with 
strong community structures, i.e. networks composed of a certain number of internally highly con- 
nected groups interconnected by few links working as bridges. An example of network with strong 
community structure is represented in Fig. 15.261 (left): fully connected cliques composed of c = 4 
agents are interconnected by single edges. Figure 15.261 (right) reports the behavior of the Naming 
Game on such a network, for different clique's size c. From simulations it turns out that, not only the 
total number of words, but even the number of different words has a plateau whose duration increases 
with the size of the system. The number of different words in the plateau equals the number of com- 
munities, while the corresponding total number of words per node is about one, proving the existence 
of a real metastable state in which the system reaches a long-lasting multi-vocabulary configuration. 
Indeed, each community reaches internal consensus but the weak connections between communities 
are not sufficient for words to propagate from one community to the other. The chosen network 
certainly has an extremely strong community structure, but preliminary studies on real networks of 
scientific collaborations give results that are in qualitative agreement with our results (i.e. plateaus 
are observed). 

More precisely, when a network contains communities of different sizes and the community structure 
is not very strong, the corresponding curves N w (t) and Nd{t) display a series of plateaus, with sharp 
transitions in between. Several groups have recently put forward methods to distinguish different lev- 
els of community structures in real and computer generated networks exploiting dynamical processes 
evolving on them (see the review articles in Refs. |193l I93| ). For instance, Diaz-Guilera et al. [T2*] 
have used synchronization properties of non-linear oscillators (deployed on the nodes of a network) in 
order to determine community structures at different levels of resolution. Communities or groups of 
highly interconnected nodes are more likely to synchronize, thus looking at the temporal evolution of 
synchronization properties it is possible to identify communities at different scales. Similar analyses 
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Figure 5.23: Sketch of various generation procedures for models of hierarchical networks: (A) the 
Barabasi- Albert tree (BA model with m = 1); (B) DMS model with m = 2, i.e. only a new triangle 
enter the system each time; (C) deterministic hierarchical network proposed in Ref. (D) random 
apollonian network ^01 1254| . 



have been carried out by Bornholdt et al. |211l I212| using Potts dynamics. In this case, the process 
leading to the community detection is the same as for the Naming Game, i.e. a coarsening dynamics 
of clusters with surface tension at the interfaces. Compared to Potts-based methods, the Naming 
Game has the relevant property that we do not have to fix the number of states in advance and the 
strength of the effective surface tension depends on the local topological constraints. Future studies 
could be addressed to modify the Naming Game model in order to have a more appropriate tool for 
community detection. 
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Figure 5.24: A typical branching point at which clusters interfaces may be pinned down. The in- 
creasing number of possible transitions for larger degree causes the effective transition probability to 
decrease. 



5.4 Agents activity in heterogeneous populations 



The aim of this section, based on the material presented in Ref. , is that of providing a detailed 
statistical description of the internal dynamics of single agents, and its relation with the collective 
behavior of the Naming Game model. 

The analysis of simulations results points out that the internal dynamics of an agent depends strongly 
on its degree, highly connected agents being much more active than low-degree nodes. The existence 
of different activity patterns is reflected in the shape of the distribution of the number of words stored 
in the inventory of a node, that turns out to depend on the level of heterogeneity of the network. In 
homogeneous networks such distribution is exponential for all agents, while highly-connected agents in 
heterogeneous networks present a distribution with a clearly gaussian tail (half- normal distribution). 
From the point of view of the evolution rule, this result shows that the role of the memory is different 
depending on the connectivity properties of single agents. The effect on the dynamics are clearer if 
we consider a closely related quantity, the cumulative distribution of the waiting times (or survival 
probability) between two consecutive successful interactions, i.e. between two decisions taken by 
the same agent. Indeed, an exponential waiting time distribution is the signature of a poissonian 
dynamics, while our results point out that the decision process associated with the internal activity 
of the agents is intrinsically non-poissonian, and it turns out to be poissonian only in the special 
case of a homogeneous network. This feature is completely new in non-equilibrium models of social 
interactions, in which the interaction rules are usually defined in such a way that the agents take 
decisions at approximately constant rate. 

Apart from the intrinsic interest for non-poissonian individual dynamics, our findings are interpreted 
in order to understand the property of strong convergence towards the absorbing state that the model 
exhibits in all small- world structures, independently of the degree heterogeneity. 
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Figure 5.25: Power-law decrease in the total number of words N w (i) for the Naming Game on several 
hierarchical networks: the Barabasi- Albert scale-free tree (full line), the DMS model with m = 2 
(dotted line), the deterministic scale-free network of Barabasi, Ravasz and Vicsek (dot-dashed line), 
the Random Apollonian Network (dot-dot-dashed line). The behavior of hierarchical networks are 
compared with the mean- field like behavior of a Barabasi- Albert network with m = 4 (dashed line). 
Note that all hierarchical networks show diffusive coarsening except for the scale-free tree, whose 
behavior is subdiffusive. 



5.4.1 Numerical results on agents activity 

By means of numerical simulations, we characterize the activity patterns of an agent, focusing on 
the dynamics of its inventory size, i.e. the number of words (opinions, states, etc) of an agent. 
In particular, our analysis is conceived for those topologies which present mean-field like dynamics 
(e.g. complete graph, homogeneous and heterogeneous random graphs, high-dimensional lattices, etc), 
where we cannot clearly identify a coarsening process leading to the nucleation and growth of clusters 
containing quiescent agents. In other topologies, as in low-dimensional lattices, the agents internal 
activity is biased by the limited number of words locally available fSection l5.3.1|l . An example of the 
different activity patterns in different topologies is reported in Fig. 15.271 In homogeneous networks 
(e.g. ER random graphs), the nodes have similar topological properties, thus their activity patterns 
are very similar as well. For heterogeneous networks, instead, highly connected nodes (hubs) play a 
different role in the dynamics compared to low degree nodes. The hubs are much more active, and 
their activity is determinant to drive the system to a rapid collective agreement. 
More precisely, they show opposite behaviors depending on the pairs selection strategy. As already 
pointed out in Section ^. 3. 31 the asymmetry of the NG interaction rules becomes relevant in the case of 
heterogeneous networks. In the direct Naming Game, that most naturally describes realistic speaker- 
hearer interactions, the speaker is chosen with probability pk (where pk is the degree distribution of 
the network), while the hearer is chosen with probability = kpk/(k). According to this selection 
criterion, the high-degree nodes are preferentially chosen as hearers. Using the opposite strategy, called 
reverse Naming Game, the hubs are preferentially selected as speakers; whereas the neutral strategy 
ensures that the roles of speaker and hearer are assigned with equal probability. Figure l5*.28l shows that 
the reverse strategy produces completely different activity patterns compared to the direct one, with a 
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Figure 5.26: (Left) A network with strong community structure. (Right) Metastable states in networks 
with strong community structure. Each community is composed of c nodes so that there are N/c 
communities. 

rather low variability and the absence of high spikes in the number of words. No significant difference 
between hubs and low-degree nodes is visible. The reason is that the inventory size increases because 
of a failure only if the node is playing as hearer. The speakers never add states to their inventories. 
Hence, agents that preferentially play as hearers tend to be more unstable, amplifying the number of 
states in the system. Using the direct strategy in heterogeneous networks favors the choice of the hubs 
as hearers, the degree of the hubs being orders of magnitude larger than the average degree. This is 
the reason of the large number of states stored in the inventory of the hubs for the direct strategy. 

A first quantity that clearly points out differences in the activity of nodes depending both on their 
degree and on the topological structure of the network is the probability distribution V n (k\t) of the 
number n of states stored in the inventory of nodes of degree k at time t. This means that the dis- 
tribution is averaged over the class of nodes of given degree. Actually, as it is shown in Appendix iDl 
this quantity depends only parametrically on the time t. Fig. 15.291 (top) reports V n {k\t) for the case 
of highly connected nodes in a heterogeneous network (the Barabasi- Albert network), whereas the 
same data for nodes of typical degree in a homogeneous network (the Erdos-Renyi random graph) 
are displayed in the bottom panel. In homogeneous networks the shape of the distribution does not 
actually depend on the degree of the node, since all nodes have degree approximately equal to the 
average degree (k). In heterogeneous networks, instead, a deep difference exists between the behavior 
of low and high degree nodes. Low degree nodes have no room to reach high values of n, thus their 
distribution has a very rapid decay (data not shown); for high degree nodes, on the contrary, the 
distributions extend for more than one decade and their form is much clearer. 

Apart from the behavior of low degree nodes, it is clear that the functional form of the distribution 
V n {k\t) is different in homogeneous and heterogeneous networks. In homogeneous networks the dis- 
tribution is exponential, while in heterogeneous networks it decays faster, and is well approximated 
by a half-normal distribution. 
Another interesting quantity is the probability distribution Q n (k\t) that an agent of degree k gets a 
success in an interaction occurring when it has n states into the inventory, i.e. the value at which the 
inventory is abruptly reset to 1. This quantity has an exponential shape in homogeneous networks 
(Fig. 15.301 -bottom) for sufficiently high average degree, and a Weibull-like shape (Fig. !5.3t)l -top) for 
the case of heterogeneous networks (high-degree nodes). The existing relation between Q n (k\t) and 



5.4. AGENTS ACTIVITY IN HETEROGENEOUS POPULATIONS 



125 




Figure 5.27: Examples of temporal series of the number of states at a given node. (Top) Series from 
a BA network with N = 10 4 nodes and (k) = 10, for nodes of high degree (e.g. k — 414) and low 
degree (e.g. k = 10). (Bottom) Series for nodes in ER random graph (N = 10 4 , (k) — 50) and in a 
one-dimensional ring (k = 2). 



V n (k\t) is straightforward: the probability distribution Q n (k\t) of the level n at which a successful 
interaction occurs is the product of the probability of having n states and the conditional probability 
Wfc(n — > l|t) that an agent (of degree k) finds at time t a (temporary) agreement when it has n states; 
i.e. Q n (k\t)=W k {n ^ l\t) V n (k\t). 



5.4.2 Theoretical interpretation and future work 

In Appendix^] we discuss a master equation approach for the jump process associated to the dynam- 
ics of single agents, by means of which it is possible to derive the correct expression for the distribution 
V n (k\t) and, consequently, that of Q n {k\t). We recover here the same result using a naive argument, 
based on the concept of survival probability in renewal processes, that is useful to clarify the role of 
non-poissonian dynamics in relation with other types of non-equilibrium statistical models. 
A renewal process |82| is a stochastic process characterized by a series of recurrent events that are 
separated by waiting times {ti\. The waiting times are mutually independent random variables with 
a common distribution T{j). We call survival probability 7^(r) the probability that the renewal event 
occurs after a waiting time at least equal to r. 

In physics, stochastic processes are usually coarse-grained models for some natural phenomenon, since 
the observed waiting times statistics is the result of some peculiar properties of the underlying phe- 
nomena. For instance, we observe power-law waiting time distributions in many natural phenomena 
and in the models used to study or reproduce these phenomena (e.g in solar flames |249l IT^j , financial 
markets |218lllfi9| . anomalous transport |253| . earthquakes (1971 fTHj . etc). Distributions with Weibull 
(and gaussian) tails are frequent in more general problems of queuing theory |121j and survival anal- 
ysis 1 11 4| . On the contrary, models of statistical mechanics traditionally used in opinion dynamics 
show exponentially shaped waiting time distributions (signature of a poissonian dynamics). 
All these different statistics are based on the factorization property of the corresponding stochastic 
(renewal) processes. Let us denote h{r) the hazard {unction of the process, i.e. the rate of occurrence 
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Figure 5.28: Example of temporal series for nodes activity in a Reverse Naming Game on a BA 
network with (k) = 10. The activity of the hubs (left) is very low (they are preferentially chosen as 
speakers) while that of low degree nodes (right) is the same as in the direct NG. 
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Figure 5.29: Distribution of the number of words for a class of high-degree nodes in a BA network 
(top) with (fc) = 10 and for average degree nodes in a ER (bottom) with (k) — 50. Both networks 
have N = 10 nodes. The distributions have been computed during the re-organizational phase after 
the peak in the number of words. 



of the renewal event at a (waiting) time r. The survival probability, i.e. the probability that the event 
occurs at time at least r satisfies the following recursion equation, 



T > (r + l)=T > (r)[l-h(r)] , 



(5.5) 



whose general solution is 



T>{t) = 



HU [i-MQ] 



E~i nLi[i-Mi)] 



(5.6) 



In the particular case of the Naming Game, a renewal event is identified with a successful interaction, 
bringing back to 1 the inventory size of the node. The pair selection mechanism is purely poissonian, 
thus an agent interacts with a precise constant rate (that is pu or depending on whether it plays 
as speaker or as hearer). This poissonian external signal can be regarded as a discrete timing for 
the internal activity of the nodes. Hence, apart from a time rescaling, the dynamics of the inventory 
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Figure 5.30: Probability distribution Q n (k\t) of the number of states n at which an agent gets a 
success, i.e. the inventory is reset to 1 state. For highly connected nodes in heterogeneous (top) and 
nodes of typical degree in homogeneous networks (bottom). Same parameters as in Fig. 15.291 

size, described in the previous section and in App. [D] gives a good approximation of the waiting time 
statistics related to the renewal events. The distributions V n (k\t) and Q n {k\t) correspond respectively 
to 7>(t) and T(t). The hazard function has thus the same functional form of the success probability 
yVk{n — * with the waiting time t instead of the number of words n. 

In the Appendix [D] (and in Ref. ,92,), we show that the success probability Wk{n —> l\t) assumes 
different forms depending on the underlying topology. In homogeneous networks, it turns out to be 
almost independent of the number n of words stored in the inventory of the node; in heterogeneous 
networks it is instead linearly proportional to n (for nodes of sufficiently high degree k). Inserting in 
Ea. 15.61 hazard functions with these functional forms, we can compute the corresponding expression 
for 7^>(t), obtaining an approximate expression also for V n (k\t). 

Let us consider a constant hazard function, from Ea. 15.61 the corresponding survival probability dis- 
tribution is exponential, and consequently also V n {k\t) decreases exponentially with n. On the other 
hand, when the hazard function grows linearly with the waiting time, with normalization constant 
C, we get 7>(t) oc exp(— ^). This simple argument provides an explanation of the gaussian de- 
crease of the distribution V n (k\t) observed for highly-connected nodes of heterogeneous networks (See 
Appendix iDl for a more rigorous approach.) 

A further remark concerns the relation between waiting times and poissonian processes. In general, 
the agent-based models studied in statistical physics are spin- like models, in which an individual is 
endowed with a variable, assuming a given set of values, each one corresponding to a different state. 
In such systems, single agent dynamics is intrinsically poissonian. For instance, in systems evolving 
by means of Glauber dynamics, spin flips at a site occur independently one of the other, i.e. they are 
poissonian events with (Boltzmann) rate A oc e~ l3AH . The corresponding waiting time distribution 
(and survival probability) is exponential (as Aexp(— At)). In the present model, on the contrary, the 
internal activity of an agent is modeled in order to reproduce a sort of decision process based on infor- 
mation storage, and the waiting time between successive decisions turns out to depend strongly on the 
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underlying topology. From the point of view of the global behavior it seems to be important: in the 
direct strategy, the hubs drive the system to a fast convergence to the absorbing state, as a result of 
the trade-off between their larger activity and their stronger inclination to reach an agreement (due 
to their internal memory patterns). 

The first step toward a better comprehension of the role of non-poissonian dynamics is that of compar- 
ing these results and the scaling properties of the convergence time with those for the reverse Naming 
Game, in which the hubs have exponential waiting time distributions. 

Finally, waiting time statistics is also used as a measure of the criticality in the behavior of 
physical systems, individuals and natural phenomena; in particular, in relation with extreme events 
|15j . Waiting time distributions with heavy-tails are signature of the absence of a characteristic scale on 
which the events occur. For example, the theory that justifies the observation of power-law distributed 
waiting times between aftershocks in earthwakes is based on the Omori law |197j , corresponding to a 
hazard function inversely proportional to the waiting time. Actually, a success rate decreasing with 
the time is necessary to get a broad waiting time distribution. 

It should be interesting to modify the interaction rules of the Naming Game model (and thus the 
hazard function) in order to change the shape of the waiting time distribution and in particular to 
get a power-law one. Such a situation would correspond to a critical decision process, in which agents 
might store a very large number of words, with an a priori unlimited memory requirement. 
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5.5 Conclusions 

In the last part of this thesis, we have investigated a model of social dynamics, the Naming Game, 
that can be considered an example of a new interesting class of dynamical processes, conceived to 
describe the onset of global agreement in a population of individuals by means of pairwise negotiation 
interactions and memory-driven decision processes. With respect to well-known models of social 
dynamics that have been borrowed from statistical mechanics (e.g. majority rules models, Voter 
model, etc), the Naming Game takes into account more realistic characteristics of social interactions, 
conserving a sufficient level of simplicity that was a good quality of the former ones. 
Thanks to this mixture of ingredients, the Naming Game (and possible variants of the model) seems 
to be more appropriate than previous models for the study of heterogeneous populations of agents, 
such as social networks, since the dynamics is the result of a strong interplay between topological 
features and the internal properties of the agents. We have focused our attention on two aspects: 

• the dynamical features of the Naming Game on different topologies; 

• single agents internal activity and its relation with the global behavior. 

In order to understand the behavior of the model on complex networks, we have analyzed the im- 
pact of the different topological structures, starting from the rather unrealistic cases of the complete 
graph (mean-field) and the one-dimensional system. However, they turn out to be precious for the 
conprehension of more complex topologies. They are, indeed, almost completely analytically solvable, 
providing two opposite behaviors: 

• the mean-field model is characterized by an initial super-spreading of words throughout the 
system, whose maximum is reached in a time t max ~ 0(N 3 ^ 2 ) and corresponds to a state in 
which each single agent possesses 0(y/~N) words; then, a very fast convergence (i.e. more than 
exponential) takes place leading the system to the global consensus in a time t conv ~ 0(iV 3 / 2 ); 

• in the one-dimensional model, agents find immediately a local consensus, many clusters of neigh- 
boring agents with a common unique word start to grow, competing in a coarsening process driven 
by the diffusion-coalescence process of the interfaces. Consequently, the maximum total number 
of words, reached very quickly in O(N) steps, scales as O(N) as well, but the global agreement 
requires a time t conv ~ 0(N 3 ). 

The second step towards the comprehension of the Naming Game dynamics has been provided by the 
study of the Watts-Strogatz model |247| . The networks generated by this model are characterized by 
a tunable parameter (the rewiring probability) that allows to interpolate between a one-dimensional 
regular lattice and a homogeneous random graph. For non-zero rewiring probability, the model has 
the small-world property, i.e. different regions of the network are connected by shortcuts, so that the 
average distance between nodes scales logarithmically with the network size. After an initial phase 
during which words are created and small local clusters appear, the small- world property ensures their 
propagation out of the local scale, boosting up the spreading process (contrarily to what happens in 
low dimensional lattices where words spreading is purely diffusive). 

The same acceleration of the dynamics is then observed in many other networks sharing the small- 
world property, suggesting that it is sufficient to recover the high temporal efficiency observed in the 
mean-field system. For both the homogeneous and heterogeneous network models, we get a scaling law 
for the convergence time t conv with the size N of the system of the type t conv ~ jyPsw ; w jth exponent 
approximately (3sw — 1-4. The discrepancy with the mean-field exponent (Pmf — 1-5) may be due 
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to logarithmic corrections. Moreover, small-world networks have higher memory efficiency than the 
mean-field model, since the peak in the total number of words scales only linearly with the size N. 
This is due to the fact that these networks are sparse (their average degree (k) is small compared to 
N). 

Nonetheless, a detailed analysis allows to point out distinct dynamical patterns on homogeneous and 
heterogeneous networks. In homogeneous networks all nodes have a similar neighborhood and therefore 
similar dynamical evolution, while in heterogeneous networks classes of nodes with different degree 
play different roles in the evolution of the Game. High degree nodes, indeed, are more likely chosen as 
hearers (in the direct Naming Game). At the beginning, low degree nodes are much more involved in 
the process of word generation than the hubs; local consensus is easily reached and a large amount of 
locally stable different words gets in touch with higher degree nodes. The latter start to accumulate 
a large number of words in their inventories, playing as "super-spreaders" of names towards less con- 
nected agents and finally driving the convergence. From this viewpoint, the convergence dynamical 
pattern of the Naming Game on heterogeneous complex networks presents some similarities with more 
studied epidemic spreading phenomena |33| . 

The shape of the degree distribution and the scaling of the average distance are not the only topolog- 
ical properties that determine the Naming Game dynamics; for this reason, we have investigated the 
effects of a number of other quantities. 

On both homogeneous and heterogeneous networks, an increase in the average degree induces a larger 
memory peak and a faster convergence, while the growth of the clustering coefficient has a completely 
opposite effect. This is particularly important in social networks that are usually characterized by a 
large level of cohesiveness. 

A very striking result concerns the convergence of the Naming Game on networks with well-defined 
hierarchical organization or community structures. On generic hierarchical networks, and particularly 
on trees, the process leading to the final agreement is very slow, governed by a diffusive or even subd- 
iffusive coarsening. We have identified the origin of this behavior in the existence of a non negligible 
surface tension between different hierarchical levels. A similar behavior is due to the presence of a 
community structure: each community finds quickly an internal agreement, but a cluster cannot easily 
expand outside its own community since the interfaces get pinned on the few bridges interconnecting 
different communities. Therefore, in this case the curve of the number of different words Nd(t) is 
not characterized by a power-law decay, but by a series of plateaus of different size corresponding to 
different levels of refinement in the community organization of the network. When the duration of a 
plateau (in Nd(t)) increases with the size of the network, we say that the system is in a metastable 
state. 

Even in presence of metastable states, if we wait sufficiently long, the system will converge to the 
absorbing state, meaning that the Naming Game model is a strongly converging dynamical rule. The 
origin of this behavior resides in the memory-based decision rule. We have investigated its implications 
both at a local and global level. 

At a local level we have focused on the internal activity of the agents in different topologies, getting 
deeper insights in the mechanisms governing the decision processes (see Section I5.4JI . In particular, 
we have found that the single agents dynamics is intrinsically non-poissonian, resulting in a stronger 
tendency to take a decision (and then to converge) for the high-degree nodes. This attitude balances 
the instability due to the fact that high-degree nodes, being in contact with many different words, are 
more exposed to perturbations in the dynamics. 

The role of the memory in the global behavior of the system is that of generating an effective surface 
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tension (see Section 15.3. 1|) . that is responsible for the coarsening of clusters. In fact, the surface ten- 
sion associated with the coarsening process of clusters of agents with a unique word is strong enough 
to ensure the convergence of the Naming Game in any dimension, but sufficiently weak to prevent 
the system to block in metastable states. From this point of view, the Naming Game is similar to 
a low-temperature Potts model, but without the typical bulk noise due to an externally imposed 
temperature-like parameter. 

Note that this form of surface tension does not have an energetic microscopic nature, as for the Ising 
model, but it is due to the introduction of temporal correlations in the decision process (i.e. the in- 
troduction of memory) similar to what happens in diffusion-limited aggregation when a mechanism of 
noise-reduction is taken into account. Some preliminary results on the comparison between clusters 
dynamics in models with memory (like the Naming Game) and other lattice spin models seem to 
corroborate this picture, that will be developed in a future work. 

In summary, as other models of opinion formation, the Naming Game shows a non- equilibrium 
dynamical evolution from a disordered state to a state of global agreement. However, with respect 
to most opinion models, in which the agents may accept or refuse to conform to the opinion of 
someone else, the Naming Game gives more importance to the bilateral negotiation process between 
pairs of agents. For this reason, the Naming Game should be regarded as a model for the emergence 
of a globally accepted linguistic convention or, in other terms, the establishment of a self-organized 
communication system; but it can be also reasonably used to describe opinion formation and other 
social polarization phenomena. The main novelty resides certainly in the introduction of pairwise 
interactions endowed with memory and feedback, that make the Naming Game phenomenology closer 
to that observed in real systems, in particular when we consider the system embedded in a complex 
network topology. 



132 CHAPTER 5. THE NAMING GAME 



Chapter 6 



General Conclusions and Outlook 



In this thesis, I have studied both numerically and analytically some structural and dynamical as- 
pects of complex networks. From the point of view of a theoretical physicist, the problems that 
phenomenological observations on complex networks are raising are undoubtedly exciting, because of 
the possibility of getting striking results applying rather simple statistical physics approaches. As a 
consequence of the small-world property, indeed, the general behavior of complex networks and of the 
dynamical processes taking place on them can be fairly described using mean- held arguments. On the 
other hand, when mean-field like methods are not applicable, or the picture we obtain from them is 
not satisfactory, in most of the cases the only possible approach is that of numerical simulations. 
This picture corresponds in general to the approach of research followed in this thesis. For example, 
in the case of the exploration of networks, we have provided a mean-field analysis that gives very good 
results at a qualitative level, but for any quantitative characterization of networks sampling the use of 
numerical simulations cannot be avoided. In the interdisciplinary field of complex networks, however, 
the quantitative aspect is very important, since the original problems are usually closely related with 
applications and theoretical results have to be compared with the abundance of phenomenological 
data. Thus, future works will be addressed to improve separately these two approaches. First, from 
the numerical side, it is worthy to extend the investigations of this thesis to more realistic models of 
Internet mapping, in order to verify the reliability of our results when the condition of shortest path 
probes is relaxed. Preliminary results have been obtained following two different approaches: one 
corresponds to use a model in which local path inflations are introduced (i.e. distortions of the short- 
est path that should reproduce the effect of traffic and policies 99 ); the other considers traceroute 
probing by means of weighted shortest paths, in which the weights are randomly distributed on the 
edges (we have only studied low disorder regimes, but also the strong disorder limit should provide 
interesting results [150 ). In both cases, the quantitative estimation of the relevant topological quan- 
tities seems not to change considerably from the results here exposed. 

From an analytical point of view, I am currently interested in understanding the origins of the bi- 
ases introduced by tree-like explorations in relations with the causal structure of the spanning trees 
generated in the sampling or more generally the framework of hidden variable models. It has been 
shown |38II237) that scale- free topological properties emerge naturally when the networks are endowed 
with a causal structure; according to this picture, sampling biases could be the natural result of the 
systematic introduction of causality in the network's topology. 

Of course, theoretical improvements of the mean-field approach are possible, for instance taking into 
account correlations in the expressions for the node and edge discovery probability. 
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The most promising possible application of the work exposed in Chapter |3 is however the definition 
of statistical estimators able to correct the biases due to the sampling process. We have successfully 
introduced estimators for the number of nodes in the Internet, but the present work is aimed to define 
unbiased estimators also for other quantities (e.g. the number of edges). 

A similar twofold approach holds for the subjects discussed in Chapter In order to achieve a more 
satisfactory comprehension of which mechanisms are responsible of the non-trivial structural and 
functional organization of real complex networks, such as the world-wide airports network, we need 
first to refine the analysis of the real data, selecting those quantities better pinpointing the functional 
and economic dimension of the system; on the other hand, we have to improve the current models of 
weighted growing networks to produce more realistic effects. 

For the inhomogeneous spreading on complex networks, whose theory is treated in Chapter 01 and 
expecially in Appendix [BJ the situation is the opposite one: we have a good understanding of the 
theoretical description of the process, but we do not dispose of phenomenological data (e.g. those 
regarding the rates of infection of a real virus on the Internet if the heterogeneity of the nodes as well 
as their functional properties are taken into account). It should be interesting to retrieve real data 
on the level of functional inhomogeneity in infrastructure networks and verify if the corresponding 
spreading properties can be analyzed within the theoretical framework here provided. 
The Naming Game is a rather recent topic of research, thus many aspects of the dynamics of the 
model are not completely clear. One of the most interesting phenomena displayed by this model is 
the presence of an effective surface tension, that is comparable with that of a low temperature Potts 
model, even if the pairwise evolution rule looks more similar to that of the Voter model |159| . in which 
the surface tension is absent. We have reason to think that the surface tension is a consequence of the 
presence of memory in local rule, therefore a future work will be addressed to show, using a simpli- 
fied model, that the presence of memory in the nodes is sufficient to produce a coarsening dynamics 
in analogy with some techniques of noise reduction studied in the problems of surface growth |175| . 
Moreover, my personal opinion is that a further simplified model of Naming Game, conceived in such 
a way to retain all its relevant properties, could allow to study analytically the dynamics on the whole 
class of mean-field like models, maybe elucidating the relation between the small- world property (and 
other topological properties) and the scaling of the convergence time. It should be also interesting 
to add some external source of noise to the system, maybe coupled with the internal activity of the 
agents, in order to study if a phase transition toward a non-trivial state in which agents do not find 
a consensus emerges. 

In conclusion, the study of dynamical phenomena on complex networks is a fascinating subject 
of research that is expected to become more and more popular in the next years, because of the 
large amount of data that is still not available on real dynamical processes and for the possibility of 
theoretical modeling by means of known statistical physics approach and for the considerable number 
of issues in which the analytical and numerical analysis of simple models can be successfully applied. 
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Appendix A 

Generating Functions in 
Percolation Problems 



In this appendix, we provide a brief introduction to the formalism of generating functions in the study 
of percolation on random graphs. We consider infinite graphs without isolated vertices, self-edge or 
multiple edges. The generating function for the degree distribution pk of a randomly chosen vertex is 

oo 

G (x)-^p fc x fc , (A.l) 
fc=i 

with Go(l) = 1, and (fc) = J^k^Pk = G' (l). Similarly, the generating function for the probability 
that a randomly chosen edge leads to a vertex of given degree is 

£**p* ~ Gl[x) ~ G' (i)- (A - 2) 

A useful property is that the probability distribution and its moments can be computed by simple 
derivative of the corresponding generating function. 

If we call q k the probability that a vertex of degree k is occupied (or node traversing probability 
if regarded as a spreading phenomena), the probability that, choosing randomly a vertex, we pick up 
an occupied vertex of degree k is the product of the probabilities of two independent events, i.e. p k <lk- 
Repeating the same operation with the edges, we need the probability that the randomly chosen edge 
is attached to an occupied vertex of degree k. This event happens with probability kpkq k / (k). Hence, 
we define the generating functions for both these probabilities that are very important in the site 
percolation, 

oo 

F (x; {q}) = ^p k q k x k , (A.3a) 

The function F (x;{q}) is the generating function of the probability that a vertex of a given degree 
exists and is occupied, while Fi(x;{q}) is the generating function for the probability of reaching a 
vertex of a given degree starting by a randomly chosen edge and that it is occupied. 

The solution of the site percolation problem is the set of values {q k } for which an infinite cluster 
(giant component) exists. In order to compute the probability that a randomly chosen vertex belongs 
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to the giant component, we start by computing the probability P s that a randomly chosen vertex 
belongs to a connected cluster of a certain size s. The use of generating functions allows to do it 
simultaneously for all the possible sizes. Then, the mean cluster size is obtained as the first derivative 
of the generating function of P s . Finally, the condition for the divergence of the mean cluster size 
gives the condition for the existence of the giant component as a function of the parameters of the 
system, that are the degree distribution and the node occupation probability. 
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Figure A.l: (A) A full dotted bullet with dashed contour line corresponds to the probability that a ver- 
tex is unoccupied. This is given by a particular series of diagrams, in which we sum the contributions 
of unoccupied vertices of all possible degrees. A striped bullet with full contour line represents the 
generating function Fq(x] {q}), whose diagrammatical expansion contains all possible combinations 
of occupied vertices reachable by an occupied vertex with a certain degree. The x accounts for the 
occupation of a vertex. The contributions of the isolated vertices (po9o) have been deleted in agree- 
ment with our convention of considering only graphs with non isolated vertices. (B) Diagrammatical 
representation of the generating function Hq(x). The first terms of the infinite series correspond to 
the summation of the Eqs. IA.5I as presented in Eq. IA.6I These contributions can be expressed in a 
compact form using Ea. IA.7l that contains the generating functions F (x;{q}) and Hi(x). 



Firstly, we consider the probability P s that a randomly chosen vertex in the network belongs to a 
connected cluster of a certain size s. We call Hq(x; {q}) its generating function, 



H (x;{q}) = J2 P sX S , (A.4) 

in which we have conventionally grouped in the term for s = the probability 1 — Y2k IkVk that a 
vertex is not occupied. Similarly, let P s be the probability that a randomly chosen edge leads to a 
cluster of a given size s, and Hi(x; {q}) its generating function. 

Since we choose the starting vertex at random, each possible degree k gives a different contribution 
to each possible cluster size probability P s , meaning that each term P s x s is itself given by an infinite 
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sum of terms labeled by the degree k. For instance, the first terms of Hq(x; {q}) are: 

Po = 1 - ^Pkqk , 

k 

oo 

PlX = 2_^piqix P , 

f=l 

I l 

P 2 x 2 = }^ Ipmx (Pi a:) P 



i=i 

oo 



P 3X 3 = ^ l pmx (iV) p Q + ^ 



1 = 1 
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Now, summing these terms and grouping similar contributions we get 
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Note that no term contains po9o according with the convention of considering only non isolated nodes. 



(A.7) 



A compact form for this expression is written using Eq. Ei 

H (x; {q}) = 1 - F {1; {q}) + xP (Pi(x; {q}); {q}) . 



The structure of Eg. IA.7l can be represented diagrammatically as shown in Fig. IA.1I associating the 
variable x to each "bare" vertex and a variable H(x; {q}) to each "dressed" vertex, while the function 
Po weights the contributions over all possible degrees. 

Moreover, the generating function H±(x; {q}) satisfies a similar self-consistent equation, 



Pi(x; {q}) = 1 - Pi(l; {q}) + xF^H^x; {<?}); {q}) , 



(A- 



that is obtained following completely similar arguments starting from picking up an edge at random. 

Taking the first derivative of Hq in Eq. IA.7I with respect to x computed in x = 1, we get the 
mean cluster size (s). Then, imposing the divergence of the latter allows to find the condition for 
the existence of a giant component, that corresponds to the Molloy-Reed criterion as presented in 
Ref. |65|. Finally, considering a uniform occupation probability q^ = q (or uniform node traversing 
probability), the expression for the site percolation threshold q c can be computed. 
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Appendix B 



A General Percolation Theory for 
Spreading Processes 

It is possible to develop a general theory for inhomogeneous joint site-bond percolation exploiting the 
method of generating functions jHSl 11031 125()j , that is briefly introduced in Appendix [5] The theory 
holds for Markovian networks, i.e. random graphs with two-point degree correlations |43) . therefore 
the edge transition probabilities are at least dependent on the degree of the two extremities. Since real 
spreading rates may depend on many other features, we account for these properties introducing multi- 
type nodes, and assigning to each edge transition probability a pair of additional labels indicating the 
types of the nodes at the extremities. 

The main result of this appendix is a general version of the Molloy-Reed criterion for the existence 
of a giant component in the case of joint site-bond percolation and, consequently, the expressions of 
the critical threshold for the two separate cases of site percolation and bond percolation. 

B.l Markovian Networks with Multi-Type Nodes 

As already mentioned in Chapter vertex- vertex correlations are usually expressed using the degree 
conditional probability p(k'\k), i.e. the probability that a vertex of degree k is connected to a vertex 
of degree k! . This has led to the definition of a class of correlated networks, called Markovian random 
networks 03] and defined only by their degree distribution pk and by the degree conditional probability 
p(k'\k). The function p(k'\k) satisfies the normalization constraint and a detailed balance condition 
(see Chapter El for details). 

In this context, the edge transition probability must depend on the degrees /c, and kj of the 
extremities. Note that, while the analysis of the standard site (bond) percolation is based on the 
relation between the degree distribution pk and the degree-dependent node occupation probability 
<7fc, in the inhomogeneous joint site-bond percolation the relation is between the pair of distributions 
{pkjP(k'\k)} and the pair of probability functions {qk,Tkk'}- 

The multi-type classification of the nodes consists in dividing the nodes of a graph into n classes, 
each one specified by a particular quality or "type". The meaning is generic, it can be a distinctive 
feature of the node (e.g. the gender or the age in a population of individuals) or it can be associated 
to some quantity that have been measured on the network (e.g. the strength or the betweenness of the 
node). Then, we consider the degree distribution pi of nodes of class h — 1, . . . ,n, conventionally 
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normalized on the relative set of nodes, i.e. J2k Pk = 1- This condition ensures the normalization to 1 
for the generating functions. Inside the classes there are no restrictions on the transition probabilities 
and they might be very different. 

Summarizing, our approach considers a Markovian correlated graph with multi-type vertices, in which 
each vertex is given an occupation probability depending on its degree and type, and each edge is 
endowed with a transition probability depending on the degrees and the types of the extremities. 

The fundamental brick for the construction of generating functions in correlated graphs is the 
rooted edge composed of a starting vertex i and the pending edge connecting it to a second 

vertex j, without explicitly considering this second extremity. For this reason we will always average 
on the degree of the second extremity of the edge. Let us consider a vertex i chosen at random, 
it will be characterized by a class h and by a degree ki. In principle, the ki edges departing from 
that node are connected to ki other nodes belonging to different classes. Actually, only Wj of them 
are really reached by a flow because of the presence of transition probabilities on the edges (we call 
such transmitting edges open). Therefore, they identify a partition of {m^, rrq , . . . , m^}, with 
J2i — m-i < ki, in which rn-p is the number of these neighbouring nodes belonging to the class I 
and linked to i by an open edge. 

Suppose that mf^ of the fc, edges emerging from a node of class h and degree ki are successfully 
connected to nodes of a same class I and (possibly different) degrees kj. The average probability 
that an edge among them allows the flow to pass is J^k, T]£k*^p( h ^(kj\ki), where p( h ~*® (kj\ki) is 
the degree conditional probability between vertices of states h and I and is the transition 

probability along an edge from a node of degree fc, in the class h to a node of degree kj in the 
class I. The origin of this term is trivial: the probability to pass along the edge is the product of 
two independent events, i.e the edge exists and it is open; then, being interested in rooted edges, 

we have to average over all possible degrees kj. The probability that there are mf of these edges 

(0 

produces a term Efc, ^fe fc P ' (kj\ki)] . Positive events give n contributions of this kind, while 

the ki — mi negative events contribute to a single term 1 — J2?=i 5Zfe. ^kik* P (kji \h) , 
that is the probability that ki — irii edges do not admit the flow's passage whichever class they 
belong to. Computing the probability of the whole event associated with the partition {m^ = 
ki — mi, rrbp , mf"' , . . . , } of the neighbours of the node with degree ki , we get the multinomial 
distribution 



(<>), 



(B.l) 



n 

n4 



1=1 "H 



(0i 



A simpler version of this multinomial distribution appears in Ref. |19(Jj . 

The following step consists in using the expression of the multinomial distribution to obtain the 
generating function of the probability that a physical quantity spreading from a vertex of class h 
successfully flows through {m"'} of its edges that point to vertices in the class {1} (I = 1,2,..., n). 
Summing over all possible values of ki and over all possible partitions of ki in n + 1 values {m^}, wc 
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obtain the generating function 



Fo\x 1 ,x 2 ,...,x n ;{q,T}) 



oo 

E(h) (h) 

k i = l 



^• ) ^mf ) )pW(fc i ,{mf ) })n : 



(0 



1=0 



(B.2) 



in which g£ is the occupation (traversing) probability of a vertex belonging to the class h with degree 
ki, S(-, ■) is a Kronecker's symbol and the x±, ... ,x n variables represent the average contributions of 
the rooted edges of the different classes. Introducing Eg. IB.ll in Eg. IB. 21 the sum over the partitions 
{mi } corresponds to the extended form of a multinomial term, providing the following expression 
for the generating function 



;{q,T}) = F^( X ;{q,T}) 



OO 

E(h) (h) 
Pk/lki 

ki = l 



1=1 



(B.3) 



With a completely similar argument, we compute the generating function F[ h ^ (x; {q, T}) of the prob- 
ability that a randomly chosen edge leads to a vertex of class h from which the spread toward its 
neighbours successfully flows through {m 1 } edges pointing to nodes of class {/} (I — 1,2, ... ,n). 
Hence, observing that now the number of emerging edges available to the spreading process reduces 
to ki — 1 and that the probability to reach the starting vertex (from an edge pointing to a generic 



vertex of class h) is — ""*%) ; the generating function F|^(x; {q,T}) reads 

Sfc k Pk 



F[ h) ( Xl ,x 2 ,...,x n ;{q,T}) = F± n > (x; {q, T}) 



E ^i J k z (h) 

V kv (h)9kl 

k i= l l^k K Pk 



i=i 



(B.4) 



As recalled in the Appendix E] the two generating functions are useful in the computation of 
a system of self-consistent equations (similar to those in Eqs. IA.7IA.8)) from which the expression 
of the average cluster size (s) should be derived. The main difference concerns the form of the 
generating functions Fq^^x; {q, T}) and F^ h \x;{q,T}), that are partitioned in classes (of nodes in 
different states) and contain the contributions of the transition probabilities. Firstly, we consider the 
probability that a randomly chosen edge leads to a vertex of class h belonging to a connected 



component of a given size s. Its generating function H{ h '(x; {q, T}) 
consistent equation 



satisfies the self- 



H[ h) (x; {q,T}) = 1 — F[ h) (1; {q, T}) + x f[ H) [h[ 1] (x; {q, T}), . . . , (x; {q, T}) ; {q, T}} , (B.5) 

where the presence of H[ h \x; {q,T}) for all h = 1, . . . , n on the r.h.s. means that the constraint on 
the value of h is required only on the starting node, not on the others reachable from it. Moreover, x 
refers to the cluster distribution and does not need any label. 

The first term in the r.h.s. of Eq. IB. 51 is due to the probability that the node of class h to which 
a chosen edge leads is not occupied, therefore it should not depend on the transmissibility of any 
outgoing edge. As required, the term 1 — (1; {q, T}) computed in x = 1 does not depend on 
{T}. This corresponds exactly to the term Pq in the cluster expansion. The second term of Eo. IB.5l 
refers to the contribution of an occupied vertex. Let us suppose that its degree is ki and consider one 
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of its outgoing edges leading to a vertex in the class I: its contribution is given by the probability 
1 — J2k- Tj. h f7* 1 ^ p( h ^ [S> (kj l \ki) that the flow does not reach the second extremity of the edge, and by 

the probability that it passes (J2k- Tf£k* p( h ~^^{kj\ki)). The latter has to be multiplied by the vertex 
function associated to the probability that this second vertex (of class I) belongs to a cluster of a 
given size. This probability is generated by the function h[ 1 \x; {q, T}) that is the correct vertex term 
for this contribution. Then, all these quantities have to be averaged over the set of degrees fcj, with 
weights pjj. and q^\ easily recovering the second term on the r.h.s.. Since the spirit of the derivation 



is completely the same, we refer again to Fig. IA.l| for a diagrammatical representation of the Eg. IB. 51 
(details are different). 

The other equation, for the generating function Hq (x; {q, T}) of the probability that a randomly 
chosen vertex of class h belongs to a cluster of fixed size s, reads 



(x; {q, T}) = 1 — F W (1; {q, T}) + x F W [H™ (x; {q, T}), H\ n> (x; {q, T}); {q, T}] , (B.6) 

Note that, by definition, both Hq(x; {q, T}) and h[ (x; {q, T}) are 1 in x = 1 for all h. 

Now, taking the derivative of Ea. IB. 61 with respect to x in x = 1, we obtain the average number of 
vertices reachable starting from a vertex in the class h, 



(n)/ 



dH, 



(h) 



{Sh) = 



dx 

F^(U 1 (l) = l;{ q ,T})+J2^- 



H[ l) '(l;{q,T}). 



(B.7) 



The second term in the r.h.s. contains linear contributions from other classes of vertices, therefore 
Ea. IB. 71 can be written in a matrix form (in the nxn product space generated by pairs of classes) as 



(s) = V x H Q (x;{q,T}) 



F [l;{q,T}] + V x F [x;{q,T}] 



(B.E 



with s = (si, S2, ■ ■ ■ , s n ). Taking the derivative of Ea. IB. 51 with respect to x in x = 1, we obtain an 



implicit expression for (l;{q,T}) 



H[ h) '(l; {q, T}) = [1; {q, T}} + ^ £-F™ [1; {q, T}}H[ iy (I; {q, T}) , 



(0'/ 



dx{ 



(B.9) 



and putting together the contributions in a matrix formulation, we get 



VxHjfofoT}) 



= P 1 [l;{g,T}] + V x P 1 [x; {q,T}} 



V a Hi{x;{ q) T}) 



x=l 



(B.10) 



x = l 



Explicitly, Ea. IB. lOl becomes 



V a H 1 (x;{q ) T}) 



= [J-V s Pi[x = !;{(?, T}]]- 1 - Fx [l;{q,T}]= [I - F\ 1 -F^l; {q,T}} , (B.ll) 



x=l 



with T — Vj;Fi[x = 1; {g,T}]. Introducing the expression in Eg. IB. 81 we obtain 

(s) = F [l; {q, T}] + V K F [x = 1; {q, T}} ■ [I - T]' 1 ■ F, [1; {q, T}} 



(B.12) 



The condition for a giant component to emerge consists in the divergence of the mean cluster size, 
i.e. of at least one of the components of (s). It corresponds to the following generalized Molloy-Reed 
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criterion for inhomogeneous joint site-bond percolation in multi-type Markovian correlated random 
graphs: 

det[J-.F]<0, (B.13) 

where T = V x Fi[x = l;{q,T}] (whose elements are of the type ^F^ h) [1; {q, T}}). 

It is evident that such a general result strongly depends on which kind of node partition we are 
considering. Two examples of partitions are particularly relevant: i) a single class collecting all the 
nodes and ii) a degree-based classification of the nodes. 



B.2 Degree-based Multi-type Solution 



In this situation the types correspond to the different degrees, thus each class gathers all vertices with 
a given degree and there is in principle an infinite number n of types (in finite networks there are as 



many types as the number of different degrees). Using the relation p k = 1 and jjjtj 



Eas. IB. 3IB.4I become 



F^ ki \xi,...,x n \{q,T}) = q ki 



c n ;{q,T}) = qki 



^2(x kj - l)T kikj p(kj\ki) 



l + ^2(x kj - l)T kikj p{kj\ki) 



= 1, the 



(B.14a) 



(B.14b) 



The self-consistent system of equations for the generating functions of clusters size probability looks 
like in Eqs. IB.5IB.6I and the condition for the existence of a giant component is still the divergence 
of the mean cluster size, but now the elements of the matrix T are 



Tii = (V„Fi[x = 1; {q,T}}) = {ki - lfaT^pikjlki 



The generalized Molloy-Reed criterion becomes 

det [(ki - l)qk i T kt k j p(kj\k i 



5a] > 



(B.15) 



(B.16) 



This expression corresponds to the criterion of Ref. |240| for the existence of percolation in correlated 
random graphs (apart from a matrix transposition), but with the difference that in this case we 
are dealing with inhomogeneous joint site-bond percolation, then both the degree-dependent node 
traversing probability and edge transition probability appear in the expression. The condition that 
percolation threshold is related to the largest eigenvalue (see Ref. |24(J) 'l is recovered if we assume that 
all nodes have equal traversing probability q ki = q = const. In other words, we are switching from 
a joint site-bond percolation to a simple site percolation with an occupation probability q. If q =/= 
(otherwise the percolation condition cannot be satisfied), we can write the condition as 



gdet [(ki - l)T kikj p(kj\ki) ~ A6 ij] ^ 



(B.17) 



with A = l/q. Since in q = the determinant is negative, the smallest positive value of q ensuring 
Ea. lB.17l to be satisfied corresponds to the largest eigenvalue A max of the matrix (ki — l)T kikj p(kj\ki). 
It follows that the critical value of site occupation probability is q = l/A rnax . In the case T kikj = 1, 
the condition gives exactly the results obtained by Vazquez et al. [240 . 

However, the complete knowledge of the correlation matrix p(kj\ki) is very unlikely for real net- 
works, and the analytical solution of Eg. IB.16l can be problematic also for simple artificial networks. 
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B.3 Local Homogeneity Approximation 

When all nodes belong to a unique class, the single-state network is recovered and the analytical 
treatment becomes easier. We recover indeed a sort of local homogeneity approximation for the prob- 
ability that an edge emerging from a vertex of degree k% is traversed by the flow. When n = 1, the 
terms on the r.h.s. of Eas. IB.3IB.4I reduces to the average over the contributions of all the second 
extremities of the edges emerging from a node, obtaining a degree-dependent effective average trans- 
mission coefficient r ki = ^2k- TkikjP{kj\ki) (0 < < 1). The advantage of this approximation is 
that of providing analytically solvable equations for the percolation condition. Moreover, even if the 
sum on kj introduces an approximation, the effective term does not neglect the contributions due 
to the edge transition probabilities, weighting them with the correct degree conditional probability as 
required for Markovian networks. If these contributions are similar the approximation is very good, 
while when the edge transition probabilities or the nodes correlations are highly heterogeneous the 
local effective medium approximation breaks down, and we have to use the general approach. 



Derivation of the Molloy-Reed Criterion - According to this approximation, the gener- 
ating function F (x- {q,T}) of the probability that a spreading process emerging from an occupied 
node flows through exactly m nodes (whatever their degrees are) is written as 



F Q (x;{q,T}) = ^ PkAk, 

k i = l 



k i= i 



hi 



(B.18) 



This expression can be alternatively derived using simple arguments (see Ref. |84| ) . The value 
Fq(1; { a iT}) assumed by this function in x = 1 is the average occupation probability (q) = ^ k Pk1k, 
that is consistent with the fact that summing over the contributions of all the possible amounts of 
emerging edges traversed by the flow means that we are simply considering the number of starting 
nodes, i.e. the average number of occupied nodes. The first derivative with respect to x computed in 
x = 1 

d 



dx 



F (x;{q,T}) 



= ^2 hp ki q ki ^ T k,k 3 p(kj\ki) , 
ki=l 



(B.19) 



k i = l 



is the average number of open edges emerging from an occupied vertex. 

Using similar arguments, we get the generating function Fi(x; {q, T}) of the probability that the flow 
spreading from a vertex, reached as an extremity of an edge picked up at random, passes through a 
given number of the remaining edges, 



F 1 (x;{q,T}) = J2 



hpk z 



-Qki 



l + (x-l)J2 Tk^pik^h) 



fc,-=i 



fci-i 



The next step consists in considering the self-consistent equations, 

H (x; {q, T}) = 1 — F (l; {q, T}) + xF [H^x; {q, T}); {q, T}] , 



(B.20) 



(B.21a) 



H 1 (x;{q,T}) = l-F 1 (l;{q,T})+xF 1 [H 1 (x;{q,T});{q,T}] . (B.21b) 

Now the two quantity Hq(x; {q, T}) (and Hi(x; {q, T})) is scalar, and represents the generating func- 
tion of the probability that a randomly chosen vertex (and, respectively, vertex reached by a randomly 



B.3. LOCAL HOMOGENEITY APPROXIMATION 



147 



chosen edge) belongs to a cluster of given size. Since using scalar quantities allows to explain some 
passages, we derive also the expression for the mean cluster size (s), 



3£*o(*;{ S ,r}) 



H^(l;{q,T}) , 



(B.22) 



where x = 1 implies the average over all possible degrees ki. From Ea. lB.21alB"21bl using the relation 
#!(!;{<?, T}) = 1, we get 



= F (l; {q, T}) + Fq(1; {q, T})H[(1; {q, T}) 



(B.23) 



where the expression for H[(l; {q,T}) comes directly from deriving Eg. IB.21bl with respect to x in 
x = 1 



Inserted into Eq. IB. 231 it leads to the well-known expression ( ) 



F (l;{q,T}) 



(B.24) 



(B.25) 



l-^(l;{g,T}) ' 

hence, the giant component exists if and only if F[(l;{q,T}) > 1. Explicitly, it means that, in the 
local effective medium approximation, the generalized Molloy-Reed criterion for the inhomogeneous 
joint site-bond percolation on Markovian random graphs takes the following form, 



^ Pki h 

ki = l 



(ki - l)q kt 2J T kikjP(kj\ki) - 1 



> 



(B.26) 



The threshold for the site percolation with inhomogeneous edges occurs when the equality holds. 
Inserting uniform occupation probability qu = q, its critical value is 



(k) 



(B.27) 



Transition probability factorization - Note that, all the expression for the site-percolation 
threshold reported in Section 14.3.11 have been derived starting from Eq. IB. 261 Henceforth, we will 
rapidly derive them, making explicit assumptions on the form of the edge transition probabilities. 
The most natural assumption consists of their factorization in two single-vertex contributions, i.e. 

Tkikj = ®i(ki)9f(kj) , (B.28) 

where subscripts i and / indicate an initial and a final term respectively, in order to stress the fact that 
first and second vertices of an edge can give different contributions in the inhomogeneous percolation 
process. 

Inserting Eq. IB. 281 in Eq. IB. 261 the condition for the existence of a giant component becomes 

oo 

£ PkM [(ki - 1)0^)0^^ - l]p(kj\ki) > . (B.29) 

k' 

In the case of uncorrelated graphs, the conditional probability also factorizes in p(k'\k) = Xtf , 
and Ea. IB.29l e.cts simpler, leading to an interesting expression for the site percolation threshold, 

' r (B.30) 



[{k*Qi{k))-(kQi{k))](kQf(k)) 



148 



APPENDIX B. A GENERAL PERCOLATION THEORY FOR SPREADING PROCESSES 



As an alternative, it seems interesting to study situations in which the transition probability is a 
function of only one of the two extremities of an edge. When it depends only on final nodes means 
that @i(k) — const. — 1 and T kikj = Qf(kj) and the site percolation threshold takes the form 

-gc° m /,J fc Lx , (B.31) 



* c [{k 2 } — (k)]{kQf(k)) - (kG f (k)) 

where q^ om is the value of the critical occupation probability for the correspondent standard homo- 
geneous site percolation. The opposite situation, 9/ (k) = const. = 1 and 9j(fc) = T k , leads to the 
following form for the Molloy-Reed criterion 

oo 

^2k[(k-l)q k T k -l] Pk >0 . (B.32) 

k=l 

The inhomogeneous site percolation threshold q c follows directly imposing uniform occupation prob- 
ability q k = q, 

qc = (k 2 T k ) - (kT k ) ■ (R33) 
From Eq. IB. 321 also the threshold's value for the inhomogeneous bond percolation is immediately 
recovered. If we suppose uniform transition probabilities T k = T, indeed, we can explicit T in 
Ea. lB.32l as a function of the set of {q k } and compute the value of the threshold as 

(B.34) 



(k 2 q k ) ~ (kq k ) 

that represents the case in which percolation on the edges is affected by refractory nodes. 

The last case we consider is that of uniform transition probabilities T kikj — T, with < T < 1. 
In the limit T = 1, the original Molloy-Reed criterion is recovered, otherwise, introducing T kikj — T 



in Eq. IB. 271 we find the same expression for the threshold of site percolation with noisy edges (as in 
Eq. EH with (T) = T) 

qc = ((fc 2 )/(fc) - Tjf ■ (B - 35) 
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Solution of the Naming Game on 
one-dimensional lattices 

In Section 15.3.11 we have discussed the behavior of the Naming Game model on low-dimensional 
regular lattices by means of numerical simulations. Local consensus is easy to reach, thus clusters 
composed of adjacent agents with the same unique word start to grow, developing a sort of effective 
surface tension at the interfaces. This effective surface tension, whose origins are partially investigated 
in Section f5. 3. II is responsible for the slow coarsening dynamics leading the system to the consensus 
state. 

In one-dimensional lattices, it is possible to solve almost exactly the dynamics mapping the Naming 
Game on a problem of diffusion and coalescence of interfaces. Let us consider a single interface be- 
tween two linear clusters of agents: in each cluster, all agents share the same unique word, say A in 
the left-hand cluster and B in the other. The interface is a string of length m composed of sites in 
which both states A and B are present. We call C m this state (A + B) m . A Co corresponds to two 
directly neighboring clusters (• • • AAABBB ■ ■ ■ ), while C m means that the interface is composed by 
m sites in the state C = A + B (• • • AAAC ■ ■ ■ CBBB ■ ■ ■ ). Note that, in the actual dynamics, two 
clusters of states A and B can be separated by a more complex interface. For instance a C m interface 
can break down into two (or more) smaller sets of C-states spaced out by A or B clusters, causing 
the number of interfaces to grow. Numerical investigation shows that such configurations are however 
eliminated in the early times of the dynamics. A sketch representing the evolution of the interface 
between two different clusters of words is reported in Fig. IG.ll 

Since the local dynamics is much faster than the global one, the probability that two neighboring 
clusters are separated by an interface of a given width is well approximated by the stationary solution 
of a Markov process governing the transitions between interfaces of different width. The Markov chain 
is easily generated by means of an iterative method. 

Let us consider a one-dimensional line composed of N sites, initially divided into two clusters of A and 
B, with an interface of zero width. The probability to select for interaction exactly the link associated 
with this Co interface is 1/N. Moreover, the outcome of such an interaction is the appearance of a C\ 
interface, since one of the two agents (that chosen as hearer) learns the word (A or B) uttered by the 
other. Thus, there is a probability po,i = that a Co interface becomes a C\ interface in a single 
time step, otherwise it stays in Co- 

We can now compute all possible evolutions, and relative probabilities, for a C\ interface. From C\ 
the interface can evolve into a Co or a C2 interface with probabilities pi t o — ^ and pi t 2 — 5^ re- 
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Figure C.l: Sketch of the evolution of an interface in a one-dimensional Naming Game. At the 
interface between two clusters (red and blue sites), one or more sites can assume more than one word. 



spectively. This procedure is easily extended to higher values of m, even if the number of the possible 
cases grows considerably and the enumeration becomes harder. In principle there is no limit to the 
growth of the interface's width; in practice, we can safely truncate this enumeration at m < 3, as 
suggested by numerical experiments in which we have counted the number of times an interface of a 
given width appears during the evolution of one-dimensional models. 

In this approximation, the problem corresponds to determine the stationary probabilities of the trun- 
cated Markov chain reported in Fig. IC.2I and defined by the transition matrix 
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in which the basis is {Co, Ci, C2, C3} and the contribution ^ from C3 to C4 has been neglected (see 

Fig. El- 

The stationary probability vector P = {Pq, Pi, P2, P3} is computed by imposing P(£ + 1) — P(t) = 0, 
i.e. (Ai T — I)P = 0. This condition gives 



Pn = 



133 



0.586, 



227 



0.344, p 2 = i±« 0.062, 



P.3 = 



0.0088 



(C.2) 



227 227 '227 227 

Direct numerical simulations of the evolution of a line • • • AAABBB ■ ■ ■ (of the type reported in 
Fig. lCHJl yields P ~ 0.581, Pi = 0.344, P 2 = 0.063, P 3 = 0.01, that are plotted in Fig. EH together 
with the theoretical prediction, confirming that our approximation is correct. 

Having shown that the width of the interfaces remains small, we assume that they are punctual 
objects localized around their central position x. For instance, in the previously mentioned example 
with A and B clusters, we denote by a;; the position of the right-most site of cluster A and by x r the 
position of the left-most site of cluster B: the position of the interface is x = X >+ Xr . An interaction 
involving sites at the interface, i.e. an interface transition C m — > G m i , corresponds to a set of possible 
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Figure C.2: Truncated Markov process associated with interface width dynamics - schematic evolution 
of a Co interface • • • AAABBB ■ ■ ■ , cut at the maximal width m = 3. 



movements for the central position x. Our purpose is that of writing down a master equation for the 
movement of the interfaces. 

The transition rates W{x — ► x ±<J) of an interface (its center) from a position x to a position x ±5 are 
obtained by enumeration of all possible cases. Using the above approximation only three symmetric 
contributions are present, corresponding to displacements of ±1/2, ±1 and ±3/2. These transition 
probabilities receive contributions from interfaces with different width. For instance, the Markov 
matrix in Ea. IC.ll savs that the C\ interface AACBB evolves into the following configurations 
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in which we have put also the corresponding displacement of the center's position x. 
Computing all possible transitions and grouping them according to the displacement (5, we obtain 
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Using the expressions for the stationary probability Pq, . 
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Figure C.3: Numerical values of the probability of the interface's width (circles) and corresponding 
theoretical predictions (crosses) obtained as solution of the truncated Markov chain. The two sets of 
values are in excellent agreement. 

The knowledge of these transition probabilities allows us to write the master equation for the 
probability V(x,t) to find the interface in position x at time t. In the limit of continuous time and 
space, expanding the single terms 
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the master equation reduces to a standard diffusion equation 

&P(x,t) Dd 2 V(x,t) 
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where D = 401/1816 ~ 0.221 is the theoretical value of the diffusion coefficient (in the appropriate 
dimensional units {dx) 2 /St). 

These results are confirmed by numerical simulations as illustrated in the Section 15.3.11 bv Fig. 15. 7| 
where the numerical probability V(x,t) is reported. The latter is a Gaussian centered around the 
initial position, with diffusion coefficient D 



exp 



0.224 « D. 



The diffusion of interfaces and the consequent coarsening dynamics governing clusters growth are the 
causes of the slow convergence of one-dimensional Naming Game model, that reaches the consensus 
absorbing state in t conv ~ 0(N 3 ) steps (0(N 2 ) if we rescale the time step by N as in statistical 
mechanics models). 



Appendix D 



Master equation approach to agents 
internal dynamics 

In this appendix, we discuss a more rigorous derivation of the probability distribution V n (k\t) that 
an agent of degree k has n words stored in its inventory (at time t) (see also Ref. |22|)- Formally, the 
master equation associated to the jump process observed in the numerical simulations for the number 
of words (opinions, etc.) n in the inventory of a fixed agent of degree k can be written as 

V n {k,t + 1) - V n (k, t) = W k {n-1^ n\t)V n -i(k, t) - W k (n ^ n + l\t)V n {k,t) (D.l) 

-W k (n^l\t)V n (k,t) N d (t)>n>l 
7MM + 1) -Vi(k,t) = Ef^WkU -> l\t)Vj(k,t) - W fc (l -> 2\t)Vi{k,t) , 

where Nd(t) is the maximum number of different words at a time t and V n (k, t + 1) depends a priori 
explicitly on the time. 

In order to get an expression for the transition rates, we call Ct{k) the number of different words 
that are accessible to a node (of degree k) at time t, i.e. that are present in the neighborhood of the 
node. This number is not known, so that we consider it as a parameter to be estimated a posteriori. 
Nevertheless, the mean-field type of dynamics that characterizes all small-world topologies, ensures 
that the quantity Ct(fc) depends mildly on k, so that in the following analysis we will often consider 
it as function only of the time Ct- In small- world topologies, indeed, there is an initial spreading of 
words throughout the network that destroys local correlations. The case of low-dimensional lattices 
is different since states can spread only locally, causing strong correlations between the inventories. 
Wc define also the inverse quantity, p t (k) = 1/Ci(k), that will be used in the following formulae for 
notation's convenience. 

The probability of a successful negotiation is np t (k), where n is the number of words in the inventory 
of an hearer of degree k. Considering both the probabilities of the agent playing as hearer and speaker, 
the transition rate W k (n — > l\t) reads 

W k (n -»• l\t) ~ p k (pt)(n) +q kPt (k)n . (D.2) 

Analogously, using the failure probability 1 — npt(k), 

W k {n^n + l\t)~ Pk (l-{p t ){n))+q k (l-pt(k)n) . (D.3) 

where the average terms (pt) and (n) comes from an uncorrelated mean-field hypothesis for the neigh- 
boring sites of the node playing as speaker. 
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Figure D.l: Parametric dependence on time of the distribution of the number of words: the time has 
the effect of deforming the shape of the distribution, but does not change the functional form. (Top) 
BA graph with N = 10 4 nodes with (k) = 10. Only the set of nodes with large k (hubs) is monitored. 
Histograms come from measurements at different times t\ and £2 with t% — t\ — 5 ■ 10° time steps. 
(Bottom) ER graph of A = 10 nodes and (fc) = 10. Measures refer to a set of high-degree nodes 
with ti — t\ = 5 • 10 5 time steps. 



An important remark concerns the temporal evolution of the master equation. From Eqs. ID.2ID.3l 
we note that the timescale of the internal dynamics is fast, of order l/pu oc N. On the other hand, 
the only time-dependent parameter is pt(k) (or C t (k)), that depends on the global behavior of the 
system, whose dynamics is slower. Unfortunately, we do not know the exact behavior of this global 
quantity, thus the only possible approach consists of considering it as an external modulation to the 
internal dynamics. According to this argument, we study the stationary condition of Eq. ID. 21 and 
compute the solution V n (k\t), in which we assume a parametric dependence of the distribution on the 
time, governing its relaxation dynamics. We have verified this assumption measuring the distribution 
at different times. As displayed in Fig. ID. II the solution depends on the time, but the dependence has 
the only effect of deforming the shape during the evolution, without changing the functional form. 
Plugging the expressions of the transition rates into the stationary master equation, we get the fol- 
lowing recursion relation, 

v n {k\t) = — — gfc ^r pt(fcK ?r 1)] , w r Viw • (d-4) 

nK 1 ' q k [l-npt{k)] + q kPt (k)n + p k (pt)(n) U 1 ' V ' 

Then, introducing qk — kpk/(k) = b(k)pk and replacing p t (k)(n — 1) with p t (k)n (that is true for 
n^> 1), Ea. ID. 41 can be rewritten as 



(D.5) 
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Figure D.2: Probability of W k (n -> l|i) and Wk(n n + 1 j *) for BA and ER models. Both with 
N = 5000 nodes and k ~ 200 (for a BA with (ft) = 10) and ft ~ 70 (for a ER with (ft) = 50). 



Since npt(k) <C 1, we can write 1 — pt(k)n ~ e thus solving the recurrence relation, 

n n-l 



7> n (fc|i) 



6(fc) 



b(k) + (p t )(n) 



(D.6) 



The normalization relation gives the constant Vi(k\t). The controlling parameter of the curve is 
St(k) = b(k) I (b(k) + (pt)(n)), that allows to tune the decay of the distribution between an exponential 
and a gaussian-like tail. 

In the next two paragraphs, we focus on these two cases separately, showing that the two different 
results derive from different forms of the transition rates, that can be also inferred from simulations. 



The case of homogeneous networks - Exploiting the remark coming from simulations, in 
homogeneous networks, we can safely put q k — b Pk , with b ~ 0(1) and pt(fc) ~ (pt) (i.e. Ct(k) ~ Ct), 
since the nodes are almost equivalent. The number of states as well is almost the same for every node 
that implies n ~ (n) . The approximated expressions of the success and failure probabilities are 

W k (n - l|i) » p k (n)(l + b)/C t « 2 Pk (n)/C t (D.7) 
W k (n -^n + l\t) » p fc (l - (p t )(n)){l + b) « 2p fe . (D.8) 

We have verified numerically the validity of these approximated expressions for the quantities Wfc (n — > 
n + l\t) and Wfe(n — > (just the term for nodes playing as hearers, since the term for speakers is 
almost constant). The data reported in Fig. ID.2I (right) show that for an ER model with N = 5 • 10 3 
nodes and (k) = 50, both quantities are almost constant with respect to n. 
The equilibrium condition for the master equation becomes 

0= 2 Pk V n - 1 (k\t)~2p k V n (k\t)- Pk (n)-^V n (k\t) n>l 

= £°1 2 £ (n) Pk V, (k\t)- 2 Pk Vx (k\t) . (D.9) 

We can neglect the dependence on k, since Pk is canceled, in agreement with the approximation done 
for the homogeneity of the network. The solution by recursion is very simple, 

v n (t) « (i - e^- 1 , 0t = —t-j-- . (d.io) 

l + M 
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Using the expansion of logarithm log(l — e t ) ~ — e t , with et = 1 — Bt — (n)/Ct, the previous formula 
gives the following exponential decay for the distribution of the number of words, 

V n (t) ~ M e -^ . (D .ll) 
The exponential decay is in agreement with the numerical data, providing a value ~ 0.16 for (n)/Ct- 

High-degree nodes in heterogeneous networks - The other important case is that of the 
hubs in heterogeneous networks. In a direct NG a hub is preferentially chosen as hearer, by a factor 
k/(k) 1, then we can neglect the first terms in both success and failure probability. We consider 
the following approximated expressions 

W k {n -> l\t) ~ q kPt {k)n , (D.12) 
W k {n -^n + l\t) ~ q k (l - p t {k)n) ~ q k , (D.13) 

in which the last approximation is justified by the fact that the number of words accessible to a hub is 
large, thus pt(k) is expected to be small. For heterogeneous networks, the numerical data in Fig. ID.2l 
(left) for yV k (n — ► l\t) show a clear linear growth of the quantity with n, in agreement with Eg. ID. 21 
while the almost constant behavior of W k (n — > n + l\t) with n can be fitted with an expression of the 
form Ea. lD.3l onlv for very small values of the product pt(k)n. On the other hand, Fie. ID.2I (bottom) 
points out that in the case of homogeneous networks both the quantities are almost constant with 
respect to n. We prove now that these different behaviors of the transition rates are responsible of 
the different shape of the probability distribution V n (k\t). 

We can easily compute the quasi-stationary distribution {V n {k\t)}, from the first equation in Eg. ID. 21 
= qkP n -i{k\t) - q k V n (k\t) - (q k + Vk-^j)V n (k\t) , (D.14) 

and we find recursively 

Vn(k\t) = cmk^-i(k\t) = {Ct{k)+ SSl k)+n _ 1) V n -2m (D-15) 

Now, from the closure relation YlnLi "Pn{k\t) = 1 we get the expression of Vi(k\t), and the final form 
for V n (k\t) becomes 



V ^ = F ( cXn + l) Ct(fc) 



T(C t (k) + l) 



Ct(k) + l p -C t (k) 

T(C t {k)+n 



(D.17) 



_ 7 (C t (fc) + l,C t (fc))_ 

where 7(0, x) is the lower incomplete Gamma function. The functional form of the quasi-stationary 
distribution is complicated, but exploiting some approximations for Gamma functions, it is possible 
to write it in a simpler form (see Ref. for details), 



V n (k\t) . ^^eW . (D.18) 

This expresssion, corresponding to a Half-Normal distribution, is in very good agreement with the 
result of simulations. It is remarkable that fitting numerical results in Fig. 15.291 (top) with this 
expression gives a consistent value for Ct(fc). Indeed, from simulations the constant prefactor is 
~ 0.0046 and C t {k) ~ 357, that plugged into J nCJk) gi ves the very close value 0.0042. 
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